Thursday, May 16, 2013

Configuring Hadoop for Failover

As of 0.20, Hadoop does not support automatic recovery in the case of a NameNode failure. This is a well known and recognized single point of failure in Hadoop.
Experience at Yahoo! shows that NameNodes are more likely to fail due to misconfiguration, network issues, and bad behavior amongst clients than actual hardware problems. Out of fifteen grids over three year period, only three NameNode failures were related to hardware problems.

Configuring Hadoop for Failover

There are some preliminary steps that must be in place prior to performing a NameNode recovery. The most important is the dfs.name.dir property. This setting configures the NameNode such that it can write to more than one directory. A typcal configuration might look something like this:
  • <property>
    • <name>dfs.name.dir</name> <value>/export/hadoop/namedir,/remote/export/hadoop/namedir</value>
    • </property>
The first directory is a local directory and the second directory is a NFS mounted directory. The NameNode will write to both locations, keeping the HDFS metadata in sync. This allows for storage of the metadata off-machine so that one will have something to recover. During startup, the NameNode will pick the most recent version of these two directories to use and then sync both of them to use the same data.
After we have configured the NameNode to write to two or more directories, we now have a working backup of the metadata. Using this data, in the more common failure scenarios, we can use this data to bring the dead NameNode from the grave.
When a Failure Occurs
Now the recovery steps:
  1. Just to be safe, make a copy of the data on the remote NFS mount for safe keeping.
  2. Pick a target machine on the same network.
  3. Change the IP address of that machine to match the NameNode's IP address. Using an interface alias to provide this address movement works as well. If this is not an option, be prepared to restart the entire grid to avoid hitting https://issues.apache.org/jira/browse/HADOOP-3988 .
  4. Install Hadoop similarly to how you did the NameNode
  5. Do not format this node!
  6. Mount the remote NFS directory in the same location.
  7. Startup the NameNode.
  8. The NameNode should start replaying the edits file, updating the image, block reports should come in, etc.
At this point, your NameNode should be up.
Other Ideas
There are some other ideas to help with NameNode recovery:
  1. Keep in mind that the SecondaryNameNode and/or the CheckpointNode also has an older copy of the NameNode metadata. If you haven't done the preliminary work above, you might still be able to recover using the data on those systems. Just note that it will only be as fresh as the last run and you will likely experience some data loss.
  2. Instead of using NFS on Linux, it may be worth while looking into DRBD. A few sites are using this with great success.

NameNode startup fails

Some problems encountered in Hadoop and ways to go about solving them. See also NameNodeFailover and ConnectionRefused.

NameNode startup fails

Exception when initializing the filesystem

ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
    at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
    at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
    at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759)
    at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
    at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
    at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
    at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
    at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:235)
    at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:176)
    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:162)
    at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
    at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
This is sometimes encountered if there is a corruption of the  edits  file in the transaction log. Try using a hex editor or equivalent to open up 'edits' and get rid of the last record. In all cases, the last record might not be complete so your NameNode is not starting. Once you update your edits, start the NameNode and run  hadoop fsck /  to see if you have any corrupt files and fix/get rid of them.
Take a back up of  dfs.name.dir  before updating and playing around with it.

Client cannot talk to filesystem

Network Error Messages

Error message: Could not get block locations. Aborting...

There are a number of possible of causes for this.
  • The NameNode may be overloaded. Check the logs for messages that say "discarding calls..."
  • There may not be enough (any) DataNode nodes running for the data to be written. Again, check the logs.
  • Every DataNode on which the blocks were stored might be down (or not connected to the NameNode; it is impossible to distinguish the two).

Error message: Could not obtain block

Your logs contain something like
INFO hdfs.DFSClient: Could not obtain block blk_-4157273618194597760_1160 from any node:  
 java.io.IOException: No live nodes contain current block 
There are no live DataNode nodes containing a copy of the block of the file you are looking for. Bring up any nodes that are down, or skip that block.

Reduce hangs

This can be a DNS issue. Two problems which have been encountered in practice are:
  • Machines with multiple NICs. In this case, set  dfs.datanode.dns.interface  (in  hdfs-site.xml ) and  mapred.datanode.dns.interface  (in  mapred-site.xml ) to the name of the network interface used by Hadoop (something like  eth0  under Linux),
  • Badly formatted or incorrect hosts and DNS files ( /etc/hosts  and { /etc/resolv.conf under Linux) can wreak havoc. Any DNS problem will hobble Hadoop, so ensure that names can be resolved correctly.

Error message saying a file "Could only be replicated to 0 nodes instead of 1"

(or any similar number such as "2 nodes instead of 3")

Client unable to connect to server, "Server not available"

Error message : Too Many Open Files on client or server

See TooManyOpenFiles