Monday, August 16, 2010

Hadoop Configurations and Tuning

To achieve maximum results from Hadoop implementations, some key considerations for configuring the Hadoop environment itself, which are described in this section. Similar to the system hardware and software recommendations given above, these settings must be tailored to the needs of the individual implementation, and users are encouraged to experiment with their own systems and environment.Nevertheless, there are factors in common among Hadoop deployments that provide general guidance.



5.1 General Configurations

• The numbers of NameNode and JobTracker server threads that handle remote procedure calls (RPCs), specified by dfs.namenode.handler.count and mapred.job.tracker.handler.count, respectively, both default to 10 and should be set to a larger number (for example, 64) for large clusters.


• The number of DataNode server threads that handle RPCs, as specified by dfs.datanode.handler.count, defaults to three and should be set to a larger number (for example, eight) if there are a large number of HDFS clients. (Note: Every additional thread consumes more memory.)


• The number of work threads on the HTTP server that runs on each TaskTracker to handle the output of map tasks on that server, as specified by tasktracker.http.threads, should be set in the range of 40 to 50 for large clusters.



5.2 HDFS-Specific Configurations

• The replication factor for each block of an HDFS file, as specified by dfs.replication, is typically set to three for fault-tolerance; setting it to a smaller value is not recommended.


• The default HDFS block size, as specified by dfs.block.size, is 64 MB in HDFS, and it is usually desirable to use a larger block size (such as 128 MB) for large file systems.


5.3 Map/Reduce-Specific Configurations





• The maximum number of map/ reduce tasks that run simultaneously on a TaskTracker, as specified by mapred. tasktracker.{map/reduce}.tasks.maximum, should usually be set in the range of(cores_per_node)/2 to 2x(cores_per_ node), especially for large clusters.

• The number of input streams (files) to be merged at once in the map/reduce tasks, as specified by io.sort.factor, should be set to a sufficiently large value (for example, 100) to minimize disk accesses.

• The JVM settings should have the parameter java.net.preferIPv4Stack set to true, to avoid timeouts in cases where the OS/JVM picks up an IPv6 address and must resolve the hostname.


5.4 Map Task-Specific Configurations


• The total size of result and metadata buffers associated with a map task, as specified by io.sort.mb, defaults to 100 MB and can be set to a higher level, such as 200 MB.


• The percentage of total buffer size that is dedicated to the metadata buffer, as specified by io.sort.record.percent, which defaults to 0.05, should be adjusted according to the key-value pair size of the particular Hadoop job

No comments:

Post a Comment