Sorting is typically done by sortByKey and more complex sorting as well as ranking is typically done by groupByKey
- To sort the data by composite keys, we need to bring all the elements to key
- Data can be sorted in ascending or descending order based on all the keys
- If we have to sort the data other than sorting data in ascending or descending order based on all elements in the key, we need to have groupByKey
- groupByKey will generate key and Iterable of all values related to that key
- We can perform what ever transformation on those Iterable of values
- Aggregations – even though we can apply aggregations, it should be given lower preference compared to reduceByKey and aggregateByKey
- Sorting – with custom logic
- Ranking – sparse as well as dense ranking
- Iterable is Scala collection and you need to have knowledge of Scala collections to apply Scala type of transformations
- After groupByKey typically we use map or flatMap and then apply Scala based collection transformations
- We can run these examples on local spark installation or virtual machines or big data labs
Let us see examples in action
- Let us get the details of orders sorted by order_status
- Let us get the details of orders sorted by order_date and order_status
- Let us get top 5 products by revenue for each day
Previous TopicNext Topic
Join the discussion at discuss.itversity.com