IT Versity

making IT resourceful

  • Courses
  • Blog
  • Labs
  • Discuss

July 26, 2017

Getting Started with Spark – Python

Spark is in memory distributed computing engine. As part of this lesson we will see how to get started with Spark.

  • Setting up development environment – we will set up development environment on our PC
  • Using Big Data labs or virtual machine images – It is better to use lab or virtual machines to explore all modules in Spark

  • Architecture of Spark – we will see architecture of the spark
  • Spark Modules – On top of core spark we have other modules such as Data Frames
  • Execution modes – Different execution modes of Spark
  • Revision of Scala – As the course is building applications using Scala, we will revise important concepts of Scala to get started with Spark
  • Revision of File Systems – Spark on top of different file systems
  • Overview of YARN – One of the execution frameworks of Spark, others include Mesos

Difference between Spark and Hadoop Map Reduce

  • Spark is also distributed computing or processing engine like Hadoop Map Reduce
  • Unlike Hadoop Map Reduce which is I/O bound, Spark is memory bound
  • Spark can process smaller data sets to larger data sets much faster than Hadoop Map Reduce, as long as data being processed fit in memory
  • Even though Spark is getting lot more prominent at the cost of Map Reduce, Map Reduce based tools can be used to process very large data sets which does not fit in memory

Pre-requisites

  • 4 GB RAM (to explore in local mode or using labs.itversity.com)
  • With 4 GB RAM, it might be challenge to explore modules such as Data frames
  • 16 GB RAM for using free virtual machine images of Cloudera, Hortonworks etc (highly recommended)
  • 64 bit operating system
  • Learn Scala for Spark

Learn Python for Spark

Related

Lesson Topics
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Architecture of Spark - Python
  • Revision of Python
  • File Systems Overview
  • Overview of YARN
  • Launching pyspark
  • Setup data sets

Next Lesson →

Article by Training Itversity / Big Data Leave a Comment

Leave a Reply Cancel reply

Login with your Social ID

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Getting Started with Spark - Python
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Architecture of Spark - Python
  • Revision of Python
  • File Systems Overview
  • Overview of YARN
  • Launching pyspark
  • Setup data sets
Building blocks for Spark applications - Python
  • Parameter files and parameters
  • SparkConf and SparkContext
  • Reading and Writing data using files
  • Resilient Distributed Datasets - Overview
  • Resilient Distributed Datasets - from collections
  • Resilient Distributed Datasets - from files
  • Overview of Actions and Transformations
Transformations and Actions - Python
  • Spark Actions
  • Row level transformations
  • Shuffling and Combiner - concept
  • Aggregations - reduceByKey and aggregateByKey
  • Joining data sets - join, cogroup and cartesian
  • set operations - distinct, union and intersect
  • Sorting and Ranking - sortByKey and groupByKey
  • groupByKey - another example
Develop word count application - Spark using Python
  • Develop wordcount using REPL
  • Add wordcount to IDE
  • Externalize properties - ConfigParser
  • Running wordcount using spark-submit
  • Execution life cycle of wordcount
  • Control arguments using spark-submit
More Spark Concepts and Core APIs
  • Develop card count by suit
  • Default number of tasks in each stage
  • Determine number of tasks for later stages
  • Managing RDD partitions - coalesce and repartition
  • mapPartitions - python
  • Revenue per product for a given month - Python
  • Accumulators - Pyspark
  • Broadcast Variables - PySpark
Data Frames and Spark SQL
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats
Streaming Analytics using Spark
  • Getting data from streaming logs using Flume
  • Overview of Kafka in processing streaming logs
  • Overview of Spark Streaming
  • Integrating Flume and Spark
  • Integrating Kafka and Spark
  • Data pipelines using Flume, Kafka and Spark

Copyright © 2019 · Education Pro Theme on Genesis Framework · WordPress · Log in