One of the most important decisions in planning a Hadoop infrastructure deployment is the number, type, and configuration of the servers specified. As with other workloads, depending on the specific Hadoop application, computation may be bound by I/O, memory, or processor resources. Those requirements will require the system-level hardware to be adjusted on a case-bycase basis, but the general guidelines suggested in this section provide a point of departure for that fine-tuning.
1.1 Choosing a Server Platform
Typically, dual-socket servers are optimal for Hadoop deployments. Servers of this type are generally more efficient than large-scale multi-processor platforms for massively distributed implementations such as Hadoop, from a per-node, cost benefit perspective. Similarly, dual-socket servers more than offset the added per node hardware cost relative to entry-level servers through superior efficiencies in terms of load-balancing and parallelization overheads. Choosing hardware based on the most current platform technologies available helps to ensure the optimal intraserver throughput and energy efficiency.
1.2 Selecting and Configuring the Hard Drive
A relatively large number of hard drives per server (typically four to six) is recommended. While it is possible to use RAID 0 to concatenate smaller drives into a larger pool, using RAID on Hadoop servers is generally not recommended because Hadoop itself orchestrates data provisioning and redundancy across individual nodes. This approach provides good results across a wide spectrum of workloads because of the way that Hadoop interacts with storage.
The optimal balance between cost and performance is generally achieved with 7200 RPM SATA drives. This is likely to evolve quickly with the evolution of drive technologies, but it is a useful rule of thumb at the time of this writing. Hard drives should run in the AHCI (Advanced Host Controller Interface) mode with NCQ (Native Command Queuing) enabled, to improve performance when multiple simultaneous read/write requests are outstanding.
1.3 Memory Sizing
Sufficient memory capacity is critical for efficient operation of servers in a Hadoop cluster, supporting high throughput by allowing large numbers of map/reduce tasks to be carried out simultaneously. Typical Hadoop applications require approximately 1–2 GB of RAM per processor core, which corresponds to 8–16 GB for a dual-socket server using quad-core processors. When deploying servers based on the Intel® Xeon® processor 5500 series, it is recommended that DIMMs (dual in-line memory modules) be populated in multiples of six to balance across available memory channels (that is, system configurations of 12 GB, 24 GB, and so on). As a final consideration, ECC (error-correcting code) memory is highly recommended for Hadoop, to detect and correct errors introduced during storage and transmission of data.
1.4 Selecting a Motherboard
To maximize the energy efficiency and performance of a Hadoop cluster, it is important to select the server motherboard carefully. Hadoop deployments do not require many of the features typically found in an enterprise data center server, and the motherboard selected should use high efficiency voltage regulators and be optimized for airflow. Many vendors have designed systems based on Intel Xeon processors with those characteristics; they are typically marketed to cloud computing or Internet data center providers. One such product is the Intel® Server Board S5500WB (formerly code-named Willowbrook), which has been pecifically designed for high-density computing environments. Selecting a server with the right motherboard can have a positive financial impact to the bottom line compared to using other enterprise-focused systems that lack similar optimizations.
1.5 Specifying a Power Supply
As a key means of reducing overall cost of ownership, organizations should specify, as part of the design and planning process, their energy-efficiency requirements for server power supplies. Power supplies certified by the 80 PLUS* Program (www.80plus.org) at various levels, including bronze, silver, and gold (with gold being the most efficient), provide organizations with objective standards to use during the procurement process.
1.6 Choosing Processors
The processor plays an important role in determining the speed, throughput, and efficiency of Hadoop clusters. The Intel Xeon processor 5500 series provides excellent performance for highly distributed workloads such as those asociated with Hadoop applications. Lab testing was performed to establish the performance benefits of the Intel Xeon processor 5500 series relative to previous-generation Intel processors
No comments:
Post a Comment