Monday, August 16, 2010

Ten Ways To Improve the RDBMS with Hadoop

1. Accelerating nightly batch business processes. Many organizations have production transaction systems that require nightly processing and have narrow windows to perform their calculations and analysis before the start of the next day. Since Hadoop can scale linearly, this can enable internal or external on-demand cloud farms to dynamically handle shrink performance windows and take on larger volume situations that an RDBMS just can't easily deal with. This doesn't elide the import/export challenges depending on the application but can certainly compress the windows between them.
2. Storage of extremely high volumes of enterprise data. The Hadoop Distributed File System is a marvel in itself and can be used to hold extremely large data sets safely on commodity hardware long term that otherwise couldn't stored or handled easily in a relational database. I am specifically talking about volumes of data that today's RDBMS's would still have trouble with, such as dozens or hundreds of petabytes and which are common in genetics, physics, aerospace, counter intelligence and other scientific, medical, and government applications.
3. Creation of automatic, redundant backups. Hadoop can then keep the data that it processes, even after it it's been imported into other enterprise systems. HDFS creates a natural, reliable, and easy-to-use backup environment for almost any amount of data at reasonable prices considering that it's essentially a high-speed online data storage environment.
4. Improving the scalability of applications. Very low cost commodity hardware can be used to power Hadoop clusters since redundancy and fault resistance is built into the software instead of using expensive enterprise hardware or software alternatives with proprietary solutions. This makes adding more capacity (and therefore scale) easier to achieve and Hadoop is an affordable and very granular way to scale out instead of up. While there can be cost in converting existing applications to Hadoop, for new applications it should be a standard option in the software selection decision tree. Note: Hadoop's fault tolerance is acceptable, not best-of-breed, so check this against your application's requirements.
5. Use of Java for data processing instead of SQL. Hadoop is a Java platform and can be used by just about anyone fluent in the language (other language options are coming available soon via APIs.) While this won't help shops that have plenty of database developers, Hadoop can be a boon to organizations that have strong Java environments with good architecture, development, and testing skills. And while yes, it's possible to use languages such as Java and C++ to write stored procedures for an RDBMS, it's not a widespread activity.
6. Producing just-in-time feeds for dashboards and business intelligence. Hadoop excels at looking at enormous amounts of data and providing detailed analysis of business data that an RDBMS would often take too long or would be too expensive to carry out. Facebook, for example, uses Hadoop for daily and hourly summaries of its 150 million+ monthly visitors. The resulting information can be quickly transferred to BI, dashboards, or mashup platforms.
7. Handling urgent, ad hoc requests for data. While certainly expensive enterprise data warehousing software can do this, Hadoop is a strong performer when it comes to quickly asking and getting answers to urgent questions involving extremely large datasets.
8. Turning unstructured data into relational data. While ETL tools and bulk load applications work well with smaller datasets, few can approach the data volume and performance that Hadoop can, especially at a similar price/performance point. The ability to take mountains of inbound or existing business data, spread the work over a large distributed cloud, add structure, and import the result into an RDBMS makes Hadoop one of the most powerful database import tools around.
9. Taking on tasks that require massive parallelism. Hadoop has been known to scale out to thousands of nodes in production environments. Even better, It requires relatively little innate programing skill to achieve since parallelism is an intrinsic property of the platform. While you can do the same with SQL, it requires some skill and experience with the techniques. In other words, you have to know what you're doing. For organizations that are experiencing ceilings with their current RDBMS, you can look at Hadoop to help break through them.
10. Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment. Done right -- and there are challenges depending on what your legacy code wants to do -- and Hadoop can be used as a way to migrate old, single core code into a highly distributed environment to provide efficient, parallel access to ultra-large datasets. Many organizations already have proven code that is tested and hardened and ready to use but is limited without an enabling framework. Hadoop adds the mature distributed computing layer than can transition these assets to a much larger and more powerful modern environment

No comments:

Post a Comment