IT Versity

making IT resourceful

  • Courses
  • Blog
  • Labs
  • Discuss

July 30, 2017

Sorting and Ranking – sortByKey and groupByKey

Topic Progress:
← Back to Lesson

Sorting is typically done by sortByKey and more complex sorting as well as ranking is typically done by groupByKey

  • To sort the data by composite keys, we need to bring all the elements to key
  • Data can be sorted in ascending or descending order based on all the keys
  • If we have to sort the data other than sorting data in ascending or descending order based on all elements in the key, we need to have groupByKey
  • groupByKey will generate key and Iterable of all values related to that key
  • We can perform what ever transformation on those Iterable of values
    • Aggregations – even though we can apply aggregations, it should be given lower preference compared to reduceByKey and aggregateByKey
    • Sorting – with custom logic
    • Ranking – sparse as well as dense ranking
  • Iterable is Scala collection and you need to have knowledge of Scala collections to apply Scala type of transformations
  • After groupByKey typically we use map or flatMap and then apply Scala based collection transformations
  • We can run these examples on local spark installation or virtual machines or big data labs

Let us see examples in action

  • Let us get the details of orders sorted by order_status
  • Let us get the details of orders sorted by order_date and order_status
  • Let us get top 5 products by revenue for each day

Related

← Previous Topic Next Topic →

Article by Training Itversity / Uncategorized Leave a Comment

Join the discussion at discuss.itversity.com

Getting Started with Spark - Python
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Architecture of Spark - Python
  • Revision of Python
  • File Systems Overview
  • Overview of YARN
  • Launching pyspark
  • Setup data sets
Building blocks for Spark applications - Python
  • Parameter files and parameters
  • SparkConf and SparkContext
  • Reading and Writing data using files
  • Resilient Distributed Datasets - Overview
  • Resilient Distributed Datasets - from collections
  • Resilient Distributed Datasets - from files
  • Overview of Actions and Transformations
Transformations and Actions - Python
  • Spark Actions
  • Row level transformations
  • Shuffling and Combiner - concept
  • Aggregations - reduceByKey and aggregateByKey
  • Joining data sets - join, cogroup and cartesian
  • set operations - distinct, union and intersect
  • Sorting and Ranking - sortByKey and groupByKey
  • groupByKey - another example
Develop word count application - Spark using Python
  • Develop wordcount using REPL
  • Add wordcount to IDE
  • Externalize properties - ConfigParser
  • Running wordcount using spark-submit
  • Execution life cycle of wordcount
  • Control arguments using spark-submit
More Spark Concepts and Core APIs
  • Develop card count by suit
  • Default number of tasks in each stage
  • Determine number of tasks for later stages
  • Managing RDD partitions - coalesce and repartition
  • mapPartitions - python
  • Revenue per product for a given month - Python
  • Accumulators - Pyspark
  • Broadcast Variables - PySpark
Data Frames and Spark SQL
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats
Streaming Analytics using Spark
  • Getting data from streaming logs using Flume
  • Overview of Kafka in processing streaming logs
  • Overview of Spark Streaming
  • Integrating Flume and Spark
  • Integrating Kafka and Spark
  • Data pipelines using Flume, Kafka and Spark

Copyright © 2019 · Education Pro Theme on Genesis Framework · WordPress · Log in