IT Versity

making IT resourceful

  • Courses
  • Blog
  • Labs
  • Discuss

May 16, 2017

Develop wordcount using REPL

Topic Progress:
← Back to Lesson

Let us come up with code for getting word count

  • Make sure to set up environment and have project sparkdemo

Setting up environment – Spark

  • We can launch REPL using spark-shell or sbt console (preferred)
  • With spark-shell we will get implicit SparkContext object where as with sbt console we have to create one
  • Here is the simple code example for creating SparkContext programmatically (for sbt console)

  • Input file format – text file format (where data is in plain text or alpha numeric format)
  • As we are reading data from text file we can use sc.textFile, it will create RDD (distributed collection) of the lines in the file
  • word count logic
    • Split each element in RDD by using space as delimiter
    • Generate key value pairs for each of the word
    • Holding key get the count of it
  • Finally we will save the output to text file
  • There are multiple ways to do it
    • sc.textFile -> flatMap -> map -> reduceByKey -> saveAsTextFile
    • sc.textFile -> flatMap -> reduceByKey (map used in the flatMap logic is from Scala Array) -> saveAsTextFile

Related

Next Topic →

Article by Training Itversity / Uncategorized Leave a Comment

Join the discussion at discuss.itversity.com

Getting Started with Spark - Scala
  • Setting up environment - Spark
  • Using labs.itversity.com
  • Virtual Machine images
  • Architecture of Spark
  • Revision of Scala
  • File Systems Overview
  • Overview of YARN
  • Launching Spark Shell
  • Setup data sets
Building blocks for Spark applications
  • Parameter files and parameters
  • SparkConf and SparkContext
  • Reading and Writing data using files
  • Resilient Distributed Datasets - Overview
  • Resilient Distributed Datasets - from collections
  • Resilient Distributed Datasets - from files
  • Overview of Actions and Transformations
Transformations and Actions
  • Spark Actions
  • Row level transformations
  • Shuffling and Combiner - concept
  • Aggregations - reduceByKey and aggregateByKey
  • Joining data sets - join, cogroup and cartesian
  • set operations - distinct, union and intersect
  • Sorting and Ranking - sortByKey and groupByKey
  • groupByKey - another example
Develop word count application - Spark using Scala
  • Develop wordcount using REPL
  • Add wordcount to IDE
  • Externalize properties - typesafe config
  • Running wordcount using jar
  • Execution life cycle of wordcount
  • Control arguments using spark-submit
More Spark Concepts and Core APIs
  • Develop card count by suit
  • Default number of tasks in each stage
  • Determine number of tasks for later stages
  • Managing RDD partitions - coalesce and repartition
  • mapPartitions - scala
  • Revenue per product for a given month - Scala
  • Accumulators - Scala
  • Broadcast Variables - Scala
Data Frames and Spark SQL
  • Data Frames - Getting started
  • Data Frames and Operations - Example
  • Data Frames - Spark SQL - Native
  • Data Frames - Spark SQL - Hive Context
  • Different file formats
Building Data Pipelines - Flume, Kafka and Spark Streaming

Copyright © 2019 · Education Pro Theme on Genesis Framework · WordPress · Log in