Monday, August 16, 2010

5 Questions - Hadoop

8.2 Does Hadoop replace databases or other existing systems?




No.
Hadoop is not a database nor does it need to replace any existing data systems you may have. Hadoop is a massively scalable storage and batch data processing system. It provides an integrated storage and processing fabric that scales horizontally with commodity hardware and provides fault tolerance through software. Rather than replace existing systems, Hadoop augments them by offloading the particularly difficult problem of simultaneously ingesting, processing and delivering/exporting large volumes of data so existing systems can focus on what they were designed to do: whether that be serve real time transactional data or provide interactive business intelligence. Furthermore, Hadoop can absorb any type of data, structured or not, from any number of sources. Data from many sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. Lastly, these results can be delivered to any existing enterprise system for further use independent of Hadoop.
For example, consider an RDBMS used for serving real-time data and ensuring transactional consistency. Asking that same database to generate complex analytical reports over large volumes of data is performance intensive and detracts from its ability to do what it does well. Hadoop is designed to store vast volumes of data, process that data in arbitrary ways, and deliver that data to wherever it is needed. By regularly exporting data from an RDBMS, you can tune your database for its interactive workload and use Hadoop to conduct arbitrarily complex analysis offline without impacting your real-time systems.
8.3 How does MapReduce relate to Hadoop and other systems?

Hadoop is an open source implementation of both the MapReduce programming model, and the underlying file system Google developed to support web scale data.
The MapReduce programming model was designed by Google to enable a clean abstraction between large scale data analysis tasks and the underlying systems challenges involved in ensuring reliable large-scale computation. By adhering to the MapReduce model, your data processing job can be easily parallelized and the programmer doesn’t have to think about the system level details of synchronization, concurrency, hardware failure, etc.
What limits the scalability of any MapReduce implementation is the underlying system storing your data and feeding it to MapReduce.
Google chose not to use a RDBMS for this because the overhead incurred from maintaining indexes, relationships and transactional guarantees isn’t necessary for batch processing. Rather, they designed a new distributed file system, from the ground up, that worked well with “shared nothing” architecture principles. They purposefully abandoned indexes, relationships and transactional guarantees because they limit horizontal scalability and slow down loading and batch data processing for semi and unstructured data.
Some RDBMSs provide MapReduce functionality. In doing so, they provide an easy way for programmers to create more expressive queries than SQL allows in a way that does not induce any additional scalability constraints on the database. However, MapReduce alone does not address the fundamental challenge of scaling out a RDBMS in a horizontal manner.
If you need indexes, relationships and transactional guarantees, you need a database. If you need a database, one that supports MapReduce will allow for more expressive analysis than one that does not.
However, if your primary need is a highly scalable storage and batch data processing system, you will often find that Hadoop utilizes commodity resources effectively to deliver a lower cost per TB for data storage and processing.
8.4 How do existing systems interact with Hadoop?

Hadoop often serves as a sink for many sources of data because Hadoop allows you to store data cost effectively and process that data in arbitrary ways at a later time. Because Hadoop doesn’t maintain indexes or relationships, you don’t need to decide how you want analyze your data in advance. Let’s look at how various systems get data into Hadoop.
Databases: Hadoop has native support for extracting data over JDBC. Many databases also have bulk dump / load functionality. In either case, depending on the type of data, it is easy to dump the entire database to Hadoop on a regular basis, or just export the updates since the last dump. Often, you will find that by dumping data to Hadoop regularly, you can store less data in your interactive systems and lower your licensing costs going forward.
Log Generators: Many systems, from web servers to sensors, generate log data and some do so at astonishing rates. These log records often have a semi-structured nature and change over time. Analyzing these logs is often difficult because they don’t “fit” nicely in relational databases and take too long to process on a single machine. Hadoop makes it easy for any number of systems to reliably stream any volume of logs to a central repository for later analysis. You often see users dump each day’s logs into a new directory so they can easily run analysis over any arbitrary time-frame’s worth of logs.
Scientific Equipment: As sensor technology improves, we see many scientific devices ranging from imagery systems (medical, satellite, etc) to DNA sequencers to high energy physics detectors, generating data at rates that vastly exceed both the write speed and capacity of a single disk. These systems can write data directly to Hadoop and as the data generation rate and processing demands increase, they can be addressed simply adding more commodity hardware to your Hadoop cluster.
Hadoop is agnostic to what type of data you store. It breaks data into manageable chunks, replicates them, and distributes multiple copies across all the nodes in your cluster so you can process your data quickly and reliably later. You can now conduct analysis that consumes all of your data. These aggregates, summaries, and reports can be exported to any other system using raw files, JDBC, or custom connectors.
8.5 How can users across an organization interact with Hadoop?

One of the great things about Hadoop is that it exposes massive amounts of data to a diverse set of users across your organization. It creates and drives a culture of data, which in turn empowers individuals at every level to make better business decisions.
When a DBA designs and optimizes a database, she considers many factors. First and foremost is the structure of the data, the access patterns for the data, and the views / reports required of the data. These early decisions limit the types of queries the database can respond to efficiently. As business users request more views on the data, it becomes a constant struggle to maintain performance and deliver new reports in timely manner. Hadoop enables a DBA to optimize the database for its primary workload, and regularly export the data for analytical purposes.
Once data previously locked up in database systems is available for easy processing, programmers can transform this data in any number of ways. They can craft more expressive queries and generate more data / cpu intensive reports without impacting production database performance. They can build pipelines that leverage data from many sources for research, development, and business processes. Programmers may find our online Hadoop training useful, especially Programming with Hadoop.
But working closely with data doesn’t stop with DBAs and programmers. By providing simple high-level interfaces, Hadoop enables less technical users to ask quick, ad-hoc questions about any data in the enterprise. This enables everyone from product managers, to analysts, to executives to participate in, and drive a culture focused on data. For more details, check out our online training for Hive (a data warehouse for Hadoop with an SQL interface) and Pig (a high level language for ad-hoc analysis).
8.6 How do I understand and predict the cost of running Hadoop?

One of the nice things about Hadoop is that understanding your costs in advance is relatively simple. Hadoop is free software and it runs on commodity hardware, which includes cloud providers like Amazon. It has been demonstrated to scale beyond tens of petabytes (PB). More importantly, it does so with linear performance characteristics and cost.
Hadoop uses commodity hardware, so every month, your costs decrease or provide more capacity for the same price point. You probably have a vendor you like, and they probably sell a dual quad-core (8 cores total) machine with 4 1TB SATA disks (you specifically don’t want RAID). Your budget and workload will decide whether you want 8 or 16GB of RAM per machine. Hadoop uses thee-way replication for data durability, so your “usable” capacity will roughly be your raw disk capacity divided by 3. For machines with 4×1TB disks, 1 TB is a good estimate for usable space, as it leaves some overhead for intermediate data and the like. It also makes the math really easy. If you use EC2, two extra large instances provide about 1 usable TB.
Armed with your initial data size, the growth rate of that data, and the cost per usable TB, you should be able to estimate the cost of your Hadoop cluster. You will also incur operational costs, but, because the software is common across all the machines, and requires little per-machine tuning, the operational cost scales sub-linearly.
To make things even easier, Cloudera provides our distribution for both local deployments and EC2 free of charge under the Apache 2.0 license. You can learn more at http://www.cloudera.com/hadoop

4 comments:

  1. Hi deepak,

    Can u call me pls regarding hadoop trainngs pls
    8097171301 or else help me with ur number to get in touch!!

    ReplyDelete
  2. Hi Deepak,

    Can you please send your contact details to adeptonbi@gmail.com

    Regards,
    Rk

    ReplyDelete
  3. Hi deepak,

    Can u call me pls regarding hadoop trainngs pls
    8977 537 538 or else help me with ur number to get in touch!!

    ReplyDelete
  4. Hi Deepak,

    Can you please call to us on this number 9652817656
    We are running campaign for Bigdata,Bigdata Testing.We have couple of student who want to learn through online.

    ReplyDelete