Join today

Data Processing using Pyspark Data Frame APIs

Build data engineering applications at scale by using Spark integrated with HDFS and YARN using multi-node cluster.
Write your awesome label here.

Intermediate Level

14hrs Time

Flexible Schedule

What’s Included

  • Multi-Node Cluster Access
  • Reference Materials
  • Practice Tasks & Exercises
  • Expert Live Sessions
  • Study Groups
  • Guided Support


  • No technical prerequisites are required, however, programming knowledge is highly desired
  • Decent configuration Computer with a quality internet connection (At least 4 GB RAM and Dual Core, 8 GB RAM and Quad Core is highly desired)

About this course

PySpark Data Frame APIs are an alternative way of processing data leveraging distributed computing capabilities of Spark. Data Engineers from application development backgrounds might prefer Data Frame APIs over Spark SQL to build Data Engineering applications. Once you go through the content related to Spark using Jupyter based environment, we will also walk you through the details about how the Spark applications are typically developed using Python, deployed as well as reviewed.  

What you’ll learn in this Course

  • Data Processing Overview using Spark or Pyspark Data Frame APIs. ·        
  • Projecting or Selecting data from Spark Data Frames, renaming columns, providing aliases, dropping columns from Data Frames, etc using Pyspark Data Frame APIs.  
  • Processing Column Data using Spark or Pyspark Data Frame APIs - You will be learning functions to manipulate strings, dates, null values, etc.
  • Basic Transformations on Spark Data Frames using Pyspark Data Frame APIs such as Filtering, Aggregations, and Sorting using functions such as filter/where, groupBy with agg, sort or orderBy, etc.
  • Joining Data Sets on Spark Data Frames using Pyspark Data Frame APIs such as join. You will learn inner joins, outer joins, etc using the right examples. ·        
  • Windowing Functions on Spark Data Frames using Pyspark Data Frame APIs to perform advanced Aggregations, Ranking, and Analytic Functions ·        
  • Spark Metastore Databases and Tables and integration between Spark SQL and Data Frame APIs ·       
  • Development, Deployment as well as Execution Life Cycle of Spark Applications ·      
  • Setup Python Virtual Environment and Project for Spark Application Development using Pycharm ·        
  • Understand complete Spark Application Development Lifecycle using Pycharm and Python
  • Build zip file for the Spark Application, copy to the environment where it is supposed to run and run. Understand how to review the Spark Application Execution Life Cycle
Founder and Chief Instructor, ITVersirty, Inc

Durga Gadiraju

Technology Adviser and Evangelist with 20+ years of IT experience in executing complex projects using a vast array of technologies including Big Data and Cloud.
Patrick Jones - Course author
Created with