Introduction to RDD
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in distributed fashion.
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source to which plugins are available.
list, dict, set are different types of collections that come as part of Python. RDD is kind of extension to it. Major difference is RDD is a distributed collection.
RDDs can be created
- using parallelized collections – sc.parallelize
- external datasets (local file system, HDFS, s3 etc)
- After invoking these commands it will not create RDD immediately by loading data into memory
- It will only create DAG and the job will be submitted only when action is performed (will be covered later)
- There are APIs on RDD to persist at different levels – persist and cache
- Data will be persisted as deserialized objects in JVM, across all the nodes on which executor tasks read data
- MEMORY_ONLY (default)
- and more
- eg: persist(StorageLevel.MEMORY_ONLY)
- Package: org.apache.spark.storage._
- Let us see the demo on the lab
Also there are different types of RDDs
- Even though there are many types of RDD, we will primarily work indirectly with
- RDD (including paired RDD, where each element will be a tuple of key value pair)
- You should keep it mind that there are many types of RDDs which will be created implicitly while processing the data
- We will see them in action later