Let us come up with code for getting word count
- Make sure to set up environment and have project sparkdemo
- We can launch REPL using spark-shell or sbt console (preferred)
- With spark-shell we will get implicit SparkContext object where as with sbt console we have to create one
- Here is the simple code example for creating SparkContext programmatically (for sbt console)
- Input file format – text file format (where data is in plain text or alpha numeric format)
- As we are reading data from text file we can use
sc.textFile
, it will create RDD (distributed collection) of the lines in the file - word count logic
- Split each element in RDD by using space as delimiter
- Generate key value pairs for each of the word
- Holding key get the count of it
- Finally we will save the output to text file
- There are multiple ways to do it
- sc.textFile -> flatMap -> map -> reduceByKey -> saveAsTextFile
- sc.textFile -> flatMap -> reduceByKey (map used in the flatMap logic is from Scala Array) -> saveAsTextFile
Join the discussion at discuss.itversity.com