Apache Spark is open source cluster computing framework. This lesson will have all the topics related to Spark using Python.
- It works with any file system (s3, HDFS etc)
- Processing will be done in-memory
- It is effective in processing streaming data loads
- It is primarily distributed by databricks
- There are many components in Spark eco system, such as Core Spark or Transformations and Actions, Streaming, MLLib, Graphx, Data Frames (from 1.3.x) etc.
For CCA Spark and Hadoop Developer certification, we just need to focus on Core Spark which means core set of Transformations and Actions. While other components change with versions, core spark syntax remained almost same.
Here is the documentation that will be provided at the time of taking certification exam. One need to use Spark programming guide.
Here is the video about the introduction of Transformation, Stage and Store. One need to be knowledgeable enough of using Spark APIs with both the programming languages – Scala and Python.
Optionally one can set up Spark 1.2.1 on Cloudera Quickstart VM. It is not required as long as one want to invoke hive queries as part of the certification.
Here are the topics that are covered as part of this lesson using pyspark
- Load data from HDFS and store results back to HDFS using Spark
- Join disparate datasets together using Spark
- Calculate aggregate statistics (e.g., average or sum) using Spark
- Filter data into a smaller dataset using Spark
- Write a query that produces ranked or sorted data using Spark
- Pyspark - Getting Started
- Move data between HDFS and Spark - pyspark
- Word count using pyspark
- Joining disparate data sets using pyspark
- Aggregating data sets using pyspark - totals
- Aggregating data sets using pyspark - by key
- Filtering data using pyspark
- Sorting and Ranking using pyspark - global
- Sorting and Ranking using pyspark - by key