The central idea behind Map/Reduce is distributed processing. You have a cluster of machines that host a shared file system where the data is stored, and allow for distributed job management.
Processes run across the cluster are called jobs. These jobs have two phases: a map phase where the data is collected, and a reduce phase where the data is further processed to create a final result set.
Letʼs take the example of finding a soul mate on a dating site. All the profiles for millions of users are stored in the shared file system, which is spread across a cluster of machines.
We submit a job to the cluster with the profile pattern that we’re seeking. In the map phase, the cluster finds all the profiles that match our requirements. In the reduce phase these profiles are sorted to find the top matches, among which only the top ten are returned.
Hadoop is a Map/Reduce framework that’s broken into two large pieces. The first is the Hadoop File System (HDFS). This is a distributed file system written in Java that works much like a standard Unix file system. On top of that is the Hadoop job execution system. This system coordinates the jobs, prioritizes and monitors them, and provides a framework that you can use to develop your own types of jobs. It even has a handy web page where you can monitor the progress of your jobs.
From a high level the Hadoop cluster system looks like Figure 1, “The Hadoop cluster architecture”.\
No comments:
Post a Comment