Most of the times default number of tasks for later stages is incorrect. Let us understand criteria to customize number of tasks which generate new stage.
- Following transformations which go through shuffling have optional parameter called numTasks
- groupByKey
- reduceByKey
- aggregateByKey
- sortByKey
- join
- cartesian
- cogroup
- If numTasks is not passed, they inherit number of tasks from earlier stage
- To customize we can set number of tasks by passing value to numTasks
- One need to understand criteria to determine numTasks accurately
- What is the amount of data each task need to process?
- What are the number of unique keys?
- Are we performing aggregations or joins or sorting and ranking?
- If aggregations, what could be the rate at which data volume will be reduced?
- Are we filtering any data before shuffling? If we are filtering how much data will be discarded as part of filtering?
- These are the few criteria, but you need to understand data as well as the behavior of APIs to come up with accurate number
- Let us see the example of both wordcount as well as cardcountbysuit
- For card count, key is dense. There are only 4 suits in millions of records. Hence number of tasks in stage 2 can be set to 1
- For word count, key is sparse.
- There are millions of words in it. But for each GB of data it is generating 30 MB of data.
- In the below example we tried to process 10 GB of data and output will be approximately 350 MB.
- Hence we can set numTasks to 4 for reduceByKey
- Thumb rule – if you are using HDFS and block size is 128 MB, then try to assign one task to process 128 MB of data
Tasks
- How many maximum reducers should be used to get aggregated reports by state in US?
- Understand the impact of skewness in data
- How mod(hash(key), numTasks) work for numbers and integers?
- What is key collision and how it will impact the behavior of partitioning the data?
- How to determine numTasks when the key is sparse (eg: word count)?
- What can be done to address skewness?
Start the discussion at discuss.itversity.com