IT Versity

making IT resourceful

  • Courses
  • Blog
  • Labs
  • Discuss

June 4, 2017 By Training Itversity Leave a Comment

Determine number of tasks for later stages

Topic Progress:
← Back to Lesson

Most of the times default number of tasks for later stages is incorrect. Let us understand criteria to customize number of tasks which generate new stage.

  • Following transformations which go through shuffling have optional parameter called numTasks
    • groupByKey
    • reduceByKey
    • aggregateByKey
    • sortByKey
    • join
    • cartesian
    • cogroup
  • If numTasks is not passed, they inherit number of tasks from earlier stage
  • To customize we can set number of tasks by passing value to numTasks
  • One need to understand criteria to determine numTasks accurately
    • What is the amount of data each task need to process?
    • What are the number of unique keys?
    • Are we performing aggregations or joins or sorting and ranking?
    • If aggregations, what could be the rate at which data volume will be reduced?
    • Are we filtering any data before shuffling? If we are filtering how much data will be discarded as part of filtering?
    • These are the few criteria, but you need to understand data as well as the behavior of APIs to come up with accurate number
  • Let us see the example of both wordcount as well as cardcountbysuit
  • For card count, key is dense. There are only 4 suits in millions of records. Hence number of tasks in stage 2 can be set to 1

  • For word count, key is sparse.
  • There are millions of words in it. But for each GB of data it is generating 30 MB of data.
  • In the below example we tried to process 10 GB of data and output will be approximately 350 MB.
  • Hence we can set numTasks to 4 for reduceByKey

  • Thumb rule – if you are using HDFS and block size is 128 MB, then try to assign one task to process 128 MB of data

Tasks

  • How many maximum reducers should be used to get aggregated reports by state in US?
  • Understand the impact of skewness in data
  • How mod(hash(key), numTasks) work for numbers and integers?
  • What is key collision and how it will impact the behavior of partitioning the data?
  • How to determine numTasks when the key is sparse (eg: word count)?
  • What can be done to address skewness?

Related

← Previous Topic

Filed Under: Uncategorized

Start the discussion at discuss.itversity.com

Socially Connected

  • Facebook
  • Google+
  • Linkedin
  • Twitter
  • YouTube
Getting Started with Spark - Scala
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Virtual Machine images
  • Architecture of Spark
  • Revision of Scala
  • File Systems Overview
  • Overview of YARN
  • Launching Spark Shell
  • Setup data sets
Building blocks for Spark applications
  • Parameter files and parameters
  • SparkConf and SparkContext
  • Reading and Writing data using files
  • Resilient Distributed Datasets - Overview
  • Resilient Distributed Datasets - from collections
  • Resilient Distributed Datasets - from files
  • Overview of Actions and Transformations
Transformations and Actions
  • Spark Actions
  • Row level transformations
  • Shuffling and Combiner - concept
  • Aggregations - reduceByKey and aggregateByKey
  • Joining data sets - join, cogroup and cartesian
  • set operations - distinct, union and intersect
  • Sorting and Ranking - sortByKey and groupByKey
  • groupByKey - another example
Develop word count application - Spark using Scala
  • Develop wordcount using REPL
  • Add wordcount to IDE
  • Externalize properties - typesafe config
  • Running wordcount using jar
  • Execution life cycle of wordcount
  • Control arguments using spark-submit
More Spark Concepts and Core APIs
  • Develop card count by suit
  • Default number of tasks in each stage
  • Determine number of tasks for later stages
  • Managing RDD partitions - coalesce and repartition
  • mapPartitions - scala
  • Revenue per product for a given month - Scala
  • Accumulators - Scala
  • Broadcast Variables - Scala
Data Frames and Spark SQL
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats
Building Data Pipelines - Flume, Kafka and Spark Streaming

Copyright © 2018 · Education Pro Theme On Genesis Framework · WordPress · Log in