Getting Started with Pyspark
As part of this topic we will see how to get started with Python and submit sample applications.
Here is the video which talks about getting started using python. Note that using spark with JDBC is not relevant for CCA certification.
Here is the video which talks about submitting pyspark applications
Here are the steps to submit spark applications using pyspark
- Open a new file and name it as saveFile.py
- As part of the program we will just try to read data from HDFS and write it back to HDFS
- Enter below code in saveFile.py
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("pyspark") sc = SparkContext(conf=conf) dataRDD = sc.textFile("/user/cloudera/sqoop_import/departments") for line in dataRDD.collect(): print(line) dataRDD.saveAsTextFile("/user/cloudera/pyspark/departmentsTesting")
- Save saveFile.py
- Run it in spark native mode
spark-submit --master local saveFile.py
- Run it in yarn mode
spark-submit --master yarn saveFile.py