Monday, August 16, 2010

RDBMS and Hadoop

Here is a comparison of the overall differences between the RDBMS and MapReduce-based systems such as Hadoop:

/*
=Start table structure
*/
#Summary, #Caption {
width: 28em;
}
.tableStyle {
width: 80%;
margin: 1em 0 1em 5%;
border: solid #666;
border-width: 1px 0 0 1px;
border-collapse: collapse;
}
.tableStyle th, .tableStyle td {
border: solid 1px #666;
border-width: 0 1px 1px 0;
padding: 0.2em;
}
/*
=End table structure
*/
/*
=Start grey colour scheme
*/
.greyScheme, .greyScheme th, .greyScheme td {
border-color: #666;
}
.greyScheme .even {
background-color: #E3F6FE;
}
.greyScheme th, .greyScheme thead td {
background-color: #B1B1B1;
}
.greyScheme th.firstColumn {
background-color: #D1D1D1;
}
/*
=End grey colour scheme
*/
________________________________________
RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Structure Fixed schema Unstructured schema
Language SQL Procedural (Java, C++, Ruby, etc)
Integrity High Low
Scaling Nonlinear Linear
Updates Read and write Write once, read many times
Latency Low High


From this it's clear that the MapReduce model cannot replace the traditional enterprise RDBMS. However, it can be a key enabler of a number of interesting scenarios that can considerably increase flexibility, turn-around times, and the ability to tackle problems that weren't possible before.
With the latter the key is that SQL-based processing of data tends not to scale linearly after a certain ceiling, usually just a handful of nodes in a cluster. With MapReduce, you can consistently get performance gains by increasing the size of the cluster. In other words, double the size of Hadoop cluster and a job will run twice as fast, triple it and the same thing, etc.

No comments:

Post a Comment