Hadoop and Spark Introduction
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
Typically Hadoop cluster can have few hundreds to few thousand nodes/physical servers. Setting up plain vanilla or Apache Hadoop and monitoring it can be tedious task and hence there are several distributions which provide tools for setting up and managing the clusters.
HDFS which is foundation to Hadoop provide solution for distributed storage, Map Reduce provide solution for distributed computing. Map Reduce works effectively to process large batches at scale, it does not work very well for interactive applications as well as micro batches. Spark – an in-memory distributed computing framework, can process data for micro batches as well as interactive applications effectively at scale. Lately Spark is gaining lot of momentum in large enterprises where enterprise hardware is used for infrastructure.
There are several distributions of Hadoop supported by respective vendors.
- Amazon EMR
These vendors provide training, support and services for the clients. As part of their distribution these vendors provide tools to simplify setup process of the cluster as well as to simplify the operations.
Hadoop eco system can be divided into core components and other tools. HDFS – Hadoop Distributed File System is the foundation in Hadoop eco system, Map Reduce is Distributed Computing framework developed in tandem with HDFS and other tools can be categorized into Map Reduce based tools and non Map Reduce based tools
- Hadoop core components
- Map Reduce
- Map Reduce based tools
- Hive – Logical database on top of HDFS with SQL based interface on top of Map Reduce to process the data
- Pig – Data flow language based interface on top of Map Reduce to process the data in HDFS
- Sqoop – Generic data movement tool to copy data between relational databases and HDFS using Map Reduce leveraging its distributed processing capabilities
- Mahout – Machine learning library which uses Map Reduce framework to process the data
- Oozie – Map Reduce based work flow tool
- Non Map Reduce based tools
- Flume – Data integration tool using Flume agents which can collect streaming data from sources such as weblogs and load into target such as HDFS
- Spark – In memory data processing tool which can accelerate data processing
- Impala – Alternative to hive to process lower volumes of data in quicker and interactive fashion
- HBase – A NoSQL database to build applications which are operational in nature at scale
Vendors such as Cloudera, Hortonworks etc package all these tools as part of distributions and provide wizards and tools to set up and maintain on larger clusters