Let us see how we can create RDDs by reading files
- We can read data from local file system in local execution mode
- We can read data from HDFS, AWS S3, Azure Blob etc in any of the 4 modes (local, stand alone, YARN and Mesos)
On your computer
- If you run spark-shell, spark shell will be launched pointing to local file system
- You can access files on your PC as part of your programs
- To access data on other file systems you have to use file system protocol such as s3:// for AWS s3
- You will be able to access in both local mode as well as stand alone mode
- To demonstrate, make sure you set up data sets as explained here
- Sample code
sc.textFile("/Users/itversity/Research/data/retail_db/orders")
- Path passed to sc.textFile have to be valid path
- You need to run
sc.textFile("/Users/itversity/Research/data/retail_db/orders").first
to preview and validate the path
On lab or virtual machines
- You can sign up for the lab and run the programs directly on the lab. Data sets are already set up under local file system /data as well as HDFS /public
- Enroll for the lab
- Login to gw01.itversity.com
- Sample code:
sc.textFile("/public/retail_db/orders").first
- If you want to follow our examples on virtual machines, you can set up data sets as explained here
- HDFS and Spark are integrated
- Hence when you invoke sc.textFile or any function to read data from files, by default it will look for data in HDFS
- To read data from local file system, you need to use file://
- To access data from local file system, you need to launch spark-shell or spark-submit in local mode only
- Sample code to read from local file system
sc.textFile("/data/retail_db/orders").first
To read files from AWS S3 or Azure Blob, please refer to their official documentation. Code will look almost same, but we need to set couple of parameters for credentials and then use respective protocol (such as s3 for AWS S3)
Start the discussion at discuss.itversity.com