Monday, June 24, 2013

Hadoop ?


What is Hadoop?

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

Hadoop was inspired by Google's Map Reduce , a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks)

can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper. The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X.

Why Hadoop? What is BigData?

Big data is a general term used to describe the voluminous amount of unstructured and semi-structured data a company creates, data that would take too much time and cost too much money to load into a relational database for analysis. (Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabyte’s of data). A primary goal for looking at big data is to discover repeatable business patterns. It’s generally accepted that unstructured data, most of it located in text files, accounts for at least 80% of an organization’s data. If left unmanaged, the sheer volume of unstructured data that’s generated each year within an enterprise can be costly in terms of storage. Unmanaged data can also pose a liability if information cannot be located in the event of a compliance audit or lawsuit. Big data analytics is often associated with cloud computing because the analysis of large data sets in real-time requires a framework like Map Reduce to distribute the work among tens, hundreds or even thousands of computers.