Monday, August 16, 2010

Fundamental Assumptions for Hadoop

When Google began ingesting and processing the entire web on a regular basis, no existing system was up for the task. Managing and processing data at this scale was simply never considered before.
To address the massive scale of data first introduced by the web, but now commonplace in many industries, Google built systems from the ground up to reliably store and process petabytes of


Assumption 1: Hardware can be reliable.
It is true that you can pay a significant premium for hardware with a mean time to failure (MTTF) that exceeds its expected lifespan. However, working with web scale data requires thousands of disks and servers. Even an MTTF of 4 years results in nearly 5 failures per week in a cluster of 1,000 nodes. For a fraction of the cost, using commodity hardware with an MTTF of 2 years, you can expect just shy of 10 failures per week. In the big scheme of things, these scenarios are nearly identical, and both require fundamentally rethinking fault tolerance. In order to provide reliable storage and computation at scale, fault tolerance must be provided through software. When this is achieved, the economics of “reliable” hardware quickly fall apart.
Assumption 2: Machines have identities.
Once you accept that all machines will eventually fail, you need to stop thinking of them as individuals with identities, or you will quickly find yourself trying to identify a machine which no longer exists. It is obvious that if you want to leverage many machines to accomplish a task, they must be able to communicate with each other. This is true, but to deal effectively with the reality of unreliable hardware, communication must be implicit. It must not depend on machine X sending some data Y to machine Z, but rather some machine saying that some other machine must process some data Y. If you maintain explicit communication at scale, you face a verification problem at least as large as your data processing problem. This shift from explicit to implicit communication allows the underlying software system to reliably ensure data is stored and processed without requiring the programmer to verify successful communication, or more importantly, allow them to make mistakes doing so.
Assumption 3: A data set can be stored on a single machine.
When working with large amounts of data, we are quickly confronted with data sets that exceed the capacity of a single disk and are intractable for a single processor. This requires changing our assumptions about how data is stored and processed. A large data set can actually be stored in many pieces across many machines to facilitate parallel computation. If each machine in your cluster stores a small piece of each data set, any machine can process part of any data set by reading from its local disk. When many machines are used in parallel, you can process your entire data set by pushing the computation to the data and thus conserving precious network bandwidth.

No comments:

Post a Comment