IT Versity

making IT resourceful

  • Courses
  • Blog
  • Labs
  • Discuss

July 26, 2017

Data Frames and Spark SQL

As part of this lesson let us understand Data Frames and Spark SQL in detail.

  • Data Frames and Data Frame operations
  • Structured Query Language – SQL
  • Spark SQL – Native Context
  • Spark SQL – Hive Context
  • Spark SQL – JDBC
  • Different file formats

In case you need well integrated environment to practice these, please sign up to the lab

Rationale

  • In many cases data will be structured
  • Rather than writing code, we can leverage simple interfaces such as SQL to process structured data
  • Additional Optimizations
  • Same execution engine for different programming languages

Typical transformations

  • Filtering the data
  • Data Cleansing
  • Data Standardization
  • Aggregation
  • Sorting and Ranking
  • Joining the data sets
  • Analytic or Windowing functions

Prepare for Data Frames

To explore data frames and perform necessary operations, we will use retail_db database (especially orders and order_items)

Related

Lesson Topics
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats

← Previous Lesson Next Lesson →

Article by Training Itversity / Big Data Leave a Comment

Join the discussion at discuss.itversity.com

Getting Started with Spark - Python
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Architecture of Spark - Python
  • Revision of Python
  • File Systems Overview
  • Overview of YARN
  • Launching pyspark
  • Setup data sets
Building blocks for Spark applications - Python
  • Parameter files and parameters
  • SparkConf and SparkContext
  • Reading and Writing data using files
  • Resilient Distributed Datasets - Overview
  • Resilient Distributed Datasets - from collections
  • Resilient Distributed Datasets - from files
  • Overview of Actions and Transformations
Transformations and Actions - Python
  • Spark Actions
  • Row level transformations
  • Shuffling and Combiner - concept
  • Aggregations - reduceByKey and aggregateByKey
  • Joining data sets - join, cogroup and cartesian
  • set operations - distinct, union and intersect
  • Sorting and Ranking - sortByKey and groupByKey
  • groupByKey - another example
Develop word count application - Spark using Python
  • Develop wordcount using REPL
  • Add wordcount to IDE
  • Externalize properties - ConfigParser
  • Running wordcount using spark-submit
  • Execution life cycle of wordcount
  • Control arguments using spark-submit
More Spark Concepts and Core APIs
  • Develop card count by suit
  • Default number of tasks in each stage
  • Determine number of tasks for later stages
  • Managing RDD partitions - coalesce and repartition
  • mapPartitions - python
  • Revenue per product for a given month - Python
  • Accumulators - Pyspark
  • Broadcast Variables - PySpark
Data Frames and Spark SQL
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats
Streaming Analytics using Spark
  • Getting data from streaming logs using Flume
  • Overview of Kafka in processing streaming logs
  • Overview of Spark Streaming
  • Integrating Flume and Spark
  • Integrating Kafka and Spark
  • Data pipelines using Flume, Kafka and Spark

Copyright © 2019 · Education Pro Theme on Genesis Framework · WordPress · Log in