Let us see control arguments that can be used with spark-submit
- We will use lab to demonstrate the usage of control arguments effectively
- Let us review the capacity of the cluster
- workernodes – 5
- Memory – 32 GB each node (total 160 GB)
- YARN nodemanager memory = 24 GB each node(total 120 GB)
- Total cores – 8 on each node (total 40)
- YARN cores – 6 on each node (total 30)
- Container memory – 4 GB
- Most of these are applicable with spark-shell as well
- Control arguments will facilitate us to change the typical run or execution time behavior of the cluster
- They can be used to increase or decrease the cluster capacity to process the data
- num-executors (YARN only) – default 2
- executor-cores of YARN or total-executor-cores for Mesos – default 1
- executor-memory – default 1 GB
- num-executors is applicable to only YARN because spark cluster will be created for each spark-submit command under YARN, where as with Mesos cluster capacity is determined at setup time
- Using defaults per job it will only run 2 tasks in parallel using 1 GB per executor
- Let us try increasing the capacity
- num-executors – 10
- executor-cores – 4
- executor-memory – 3.5 GB
- As container memory is capped at 4 GB, executor-memory + spark.yarn.executor.memoryOverhead cannot be greater than 4 GB
- You can review the spark.yarn.executor.memoryOverhead from spark-defaults.conf using ambari or command line
- We can use –conf to define property value at run time (eg: –conf spark.ui.port=54123)
- All these control arguments are internally represented as environment properties. In Spark UI, we can review these under Environment tab
Join the discussion at discuss.itversity.com