IT Versity

making IT resourceful

  • Courses
  • Blog
  • Labs
  • Discuss

May 16, 2017 By Training Itversity Leave a Comment

Resilient Distributed Datasets – from files

Topic Progress:
← Back to Lesson

Let us see how we can create RDDs by reading files

  • We can read data from local file system in local execution mode
  • We can read data from HDFS, AWS S3, Azure Blob etc in any of the 4 modes (local, stand alone, YARN and Mesos)

On your computer

  • If you run spark-shell, spark shell will be launched pointing to local file system
  • You can access files on your PC as part of your programs
  • To access data on other file systems you have to use file system protocol such as s3:// for AWS s3
  • You will be able to access in both local mode as well as stand alone mode
  • To demonstrate, make sure you set up data sets as explained here
  • Sample code sc.textFile("/Users/itversity/Research/data/retail_db/orders")
  • Path passed to sc.textFile have to be valid path
  • You need to run sc.textFile("/Users/itversity/Research/data/retail_db/orders").first to preview and validate the path

On lab or virtual machines

  • You can sign up for the lab and run the programs directly on the lab. Data sets are already set up under local file system /data as well as HDFS /public
    • Enroll for the lab
    • Login to gw01.itversity.com
    • Sample code: sc.textFile("/public/retail_db/orders").first

  • If you want to follow our examples on virtual machines, you can set up data sets as explained here
  • HDFS and Spark are integrated
  • Hence when you invoke sc.textFile or any function to read data from files, by default it will look for data in HDFS
  • To read data from local file system, you need to use file://
  • To access data from local file system, you need to launch spark-shell or spark-submit in local mode only
  • Sample code to read from local file system sc.textFile("/data/retail_db/orders").first

To read files from AWS S3 or Azure Blob, please refer to their official documentation. Code will look almost same, but we need to set couple of parameters for credentials and then use respective protocol (such as s3 for AWS S3)

Related

← Previous Topic

Filed Under: Uncategorized

Start the discussion at discuss.itversity.com

Socially Connected

  • Facebook
  • Google+
  • Linkedin
  • Twitter
  • YouTube
Getting Started with Spark - Scala
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Virtual Machine images
  • Architecture of Spark
  • Revision of Scala
  • File Systems Overview
  • Overview of YARN
  • Launching Spark Shell
  • Setup data sets
Building blocks for Spark applications
  • Parameter files and parameters
  • SparkConf and SparkContext
  • Reading and Writing data using files
  • Resilient Distributed Datasets - Overview
  • Resilient Distributed Datasets - from collections
  • Resilient Distributed Datasets - from files
  • Overview of Actions and Transformations
Transformations and Actions
  • Spark Actions
  • Row level transformations
  • Shuffling and Combiner - concept
  • Aggregations - reduceByKey and aggregateByKey
  • Joining data sets - join, cogroup and cartesian
  • set operations - distinct, union and intersect
  • Sorting and Ranking - sortByKey and groupByKey
  • groupByKey - another example
Develop word count application - Spark using Scala
  • Develop wordcount using REPL
  • Add wordcount to IDE
  • Externalize properties - typesafe config
  • Running wordcount using jar
  • Execution life cycle of wordcount
  • Control arguments using spark-submit
More Spark Concepts and Core APIs
  • Develop card count by suit
  • Default number of tasks in each stage
  • Determine number of tasks for later stages
  • Managing RDD partitions - coalesce and repartition
  • mapPartitions - scala
  • Revenue per product for a given month - Scala
  • Accumulators - Scala
  • Broadcast Variables - Scala
Data Frames and Spark SQL
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats
Building Data Pipelines - Flume, Kafka and Spark Streaming

Copyright © 2018 · Education Pro Theme On Genesis Framework · WordPress · Log in