Apache Spark is open source cluster computing framework. This lesson will have all the topics related to Spark using Scala.
- It works with any file system (s3, HDFS etc)
- Processing will be done in-memory
- It is effective in processing streaming data loads
- It is primarily distributed by databricks
- There are many components in Spark eco system, such as Core Spark or Transformations and Actions, Streaming, MLLib, Graphx, Data Frames (from 1.3.x) etc.
For CCA Spark and Hadoop Developer certification, we just need to focus on Core Spark which means core set of Transformations and Actions. While other components change with versions, core spark syntax remained almost same.
Here is the documentation that will be provided at the time of taking certification exam. One need to use Spark programming guide.
Here is the video about the introduction of Transformation, Stage and Store. One need to be knowledgeable enough of using Spark APIs with both the programming languages – Scala and Python.
Optionally one can set up Spark 1.2.1 on Cloudera Quickstart VM. It is not required as long as one want to invoke hive queries as part of the certification.
- Spark with Scala - Getting Started
- Move data between HDFS and Spark - scala
- Word count using spark scala
- Joining disparate data sets using scala
- Aggregating data sets using scala - totals
- Aggregating data sets using scala - by key
- Filtering data using scala
- Sorting and Ranking using scala - global
- Sorting and Ranking using scala - by key