IT Versity

making IT resourceful

  • Courses
  • Blog
  • Labs
  • Discuss

May 29, 2017

Control arguments using spark-submit

Topic Progress:
← Back to Lesson

Let us see control arguments that can be used with spark-submit

  • We will use lab to demonstrate the usage of control arguments effectively

  • Let us review the capacity of the cluster
    • workernodes – 5
    • Memory – 32 GB each node (total 160 GB)
    • YARN nodemanager memory = 24 GB each node(total 120 GB)
    • Total cores – 8 on each node (total 40)
    • YARN cores – 6 on each node (total 30)
    • Container memory – 4 GB
  • Most of these are applicable with spark-shell as well
  • Control arguments will facilitate us to change the typical run or execution time behavior of the cluster
  • They can be used to increase or decrease the cluster capacity to process the data
    • num-executors (YARN only) – default 2
    • executor-cores of YARN or total-executor-cores for Mesos – default 1
    • executor-memory – default 1 GB
    • num-executors is applicable to only YARN because spark cluster will be created for each spark-submit command under YARN, where as with Mesos cluster capacity is determined at setup time
    • Using defaults per job it will only run 2 tasks in parallel using 1 GB per executor
  • Let us try increasing the capacity
    • num-executors – 10
    • executor-cores – 4
    • executor-memory – 3.5 GB
    • As container memory is capped at 4 GB, executor-memory + spark.yarn.executor.memoryOverhead cannot be greater than 4 GB
    • You can review the spark.yarn.executor.memoryOverhead from spark-defaults.conf using ambari or command line
  • We can use –conf to define property value at run time (eg: –conf spark.ui.port=54123)
  • All these control arguments are internally represented as environment properties. In Spark UI, we can review these under Environment tab

Related

← Previous Topic

Article by Training Itversity / Uncategorized Leave a Comment

Join the discussion at discuss.itversity.com

Getting Started with Spark - Scala
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Virtual Machine images
  • Architecture of Spark
  • Revision of Scala
  • File Systems Overview
  • Overview of YARN
  • Launching Spark Shell
  • Setup data sets
Building blocks for Spark applications
  • Parameter files and parameters
  • SparkConf and SparkContext
  • Reading and Writing data using files
  • Resilient Distributed Datasets - Overview
  • Resilient Distributed Datasets - from collections
  • Resilient Distributed Datasets - from files
  • Overview of Actions and Transformations
Transformations and Actions
  • Spark Actions
  • Row level transformations
  • Shuffling and Combiner - concept
  • Aggregations - reduceByKey and aggregateByKey
  • Joining data sets - join, cogroup and cartesian
  • set operations - distinct, union and intersect
  • Sorting and Ranking - sortByKey and groupByKey
  • groupByKey - another example
Develop word count application - Spark using Scala
  • Develop wordcount using REPL
  • Add wordcount to IDE
  • Externalize properties - typesafe config
  • Running wordcount using jar
  • Execution life cycle of wordcount
  • Control arguments using spark-submit
More Spark Concepts and Core APIs
  • Develop card count by suit
  • Default number of tasks in each stage
  • Determine number of tasks for later stages
  • Managing RDD partitions - coalesce and repartition
  • mapPartitions - scala
  • Revenue per product for a given month - Scala
  • Accumulators - Scala
  • Broadcast Variables - Scala
Data Frames and Spark SQL
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats
Building Data Pipelines - Flume, Kafka and Spark Streaming

Copyright © 2019 · Education Pro Theme on Genesis Framework · WordPress · Log in