Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
Typically Hadoop cluster can have few hundreds to few thousand nodes/physical servers. Setting up plain vanilla or Apache Hadoop and monitoring it can be tedious task and hence there are several distributions which provide tools for setting up and managing the clusters.
Major Hadoop Distributions
There are several distributions of Hadoop supported by respective vendors.
- Amazon EMR
These vendors provide training, support and services for the clients. As part of their distribution these vendors provide tools to simplify setup process of the cluster as well as to simplify the operations.
To do: Video will be provided to explain differences in detail
Hadoop eco system
Hadoop eco system can be divided into core components and other tools. HDFS – Hadoop Distributed File System is the foundation in Hadoop eco system, Map Reduce is Distributed Computing framework developed in tandem with HDFS and other tools can be categorized into Map Reduce based tools and non Map Reduce based tools
- Hadoop core components
- Map Reduce
- Map Reduce based tools
- Hive – Logical database on top of HDFS with SQL based interface on top of Map Reduce to process the data
- Pig – Data flow language based interface on top of Map Reduce to process the data in HDFS
- Sqoop – Generic data movement tool to copy data between relational databases and HDFS using Map Reduce leveraging its distributed processing capabilities
- Mahout – Machine learning library which uses Map Reduce framework to process the data
- Oozie – Map Reduce based work flow tool
- Non Map Reduce based tools
- Flume – Data integration tool using Flume agents which can collect streaming data from sources such as weblogs and load into target such as HDFS
- Spark – In memory data processing tool which can accelerate data processing
- Impala – Alternative to hive to process lower volumes of data in quicker and interactive fashion
- HBase – A NoSQL database to build applications which are operational in nature at scale
Vendors such as Cloudera, Hortonworks etc package all these tools as part of distributions and provide wizards and tools to set up and maintain on larger clusters
This video covers introduction about itversity and the course: