Spark is in memory distributed computing engine. As part of this lesson we will see how to get started with Spark.
- Setting up development environment – we will set up development environment on our PC
- Using Big Data labs or virtual machine images – It is better to use lab or virtual machines to explore all modules in Spark
- Architecture of Spark – we will see architecture of the spark
- Spark Modules – On top of core spark we have other modules such as Data Frames
- Execution modes – Different execution modes of Spark
- Revision of Scala – As the course is building applications using Scala, we will revise important concepts of Scala to get started with Spark
- Revision of File Systems – Spark on top of different file systems
- Overview of YARN – One of the execution frameworks of Spark, others include Mesos
Difference between Spark and Hadoop Map Reduce
- Spark is also distributed computing or processing engine like Hadoop Map Reduce
- Unlike Hadoop Map Reduce which is I/O bound, Spark is memory bound
- Spark can process smaller data sets to larger data sets much faster than Hadoop Map Reduce, as long as data being processed fit in memory
- Even though Spark is getting lot more prominent at the cost of Map Reduce, Map Reduce based tools can be used to process very large data sets which does not fit in memory
Pre-requisites
- 4 GB RAM (to explore in local mode or using labs.itversity.com)
- With 4 GB RAM, it might be challenge to explore modules such as Data frames
- 16 GB RAM for using free virtual machine images of Cloudera, Hortonworks etc (highly recommended)
- 64 bit operating system
- Learn Scala for Spark
Leave a Reply