1.1 Selecting the Operating System and JVM
Using a Linux* distribution based on kernel version 2.6.30 or later is recommended when deploying Hadoop on current generation servers because of the optimizations included for energy and threading efficiency. For example, Intel has observed that energy consumption can be up to 60 percent (42 watts) higher at idle for each server using older versions of Linux.6 Such power inefficiency, multiplied over a large Hadoop cluster, could amount to significant additional energy costs. For better performance, the local file systems (for example, ext3 or xfs) are usually mounted with the no atime attribute. In addition, Sun Java* 6 is required to run Hadoop, and the latest version (Java 6u14 or later) is recommended to take advantage of optimizations such as compressed ordinary object pointers.
The default Linux open file descriptor limit is set to 1024, which is usually too low for Hadoop daemons. This setting should be increased to approximately 64,000 using the /etc/security/limits.conf file or alternate means. If the Linux kernel 2.6.28 is used, the default open epoll file descriptor limit is 128, which is too low for Hadoop and should be increased to approximately 4096 using the /etc/sysctl.conf file or alternate means.
1.2 Choosing Hadoop Versions and Distributions
When selecting a version of Hadoop for the implementation, organizations must seek a balance between the enhancements available from the most recent available release and the stability available from more mature versions.
For example, at the time of this writing, the most recent stable version of Hadoop is 0.18.3, while the latest release of Hadoop, version 0.20.0, contains important enhancements, including pluggable scheduling API, capacity scheduler, fair scheduler, and multiple task assignment. One other potential advantage of using Hadoop 0.20.0 is in the area of performance. Intel’s lab testing shows that some workloads within Hadoop can benefit from the multi-task assignment features in 0.20.0. Although the Map stage in v0.20.0 is slower and uses more memory than v0.19.1, the overall job runs at about the same speed or up to 8
percent faster in v0.20.0 in the case of Hadoop Sort
The primary source for securing the latest distribution is the Apache Software Foundation Web site (www.apache.org).For companies planning Hadoop installations, it may be worthwhile to
evaluate the Cloudera distribution, which includes RPM and Debian* packaging and tools for configuration. Intel has used Cloudera’s distribution on some of its lab systems for performance testing.
No comments:
Post a Comment