Word count using pyspark
In this topic we will try to understand how to write word count program.
Here is the video to get word count by using flatMap, map and reduceByKey functions
Here is the code snippet and details about the logic. One can run one line at a time and understand what is going on.
- Reading data from HDFS location
data = sc.textFile("/user/cloudera/wordcount.txt")
- Flatten each line into multiple words using ” ” (space) as delimiter.
dataFlatMap = data.flatMap(lambda x: x.split(" "))
- Associate value 1 for each of the input word to map function.
dataMap = dataFlatMap.map(lambda x: (x, 1))
- Aggregating using key (which are nothing but all unique words)
dataReduceByKey = dataMap.reduceByKey(lambda x,y: x + y)
- Saving to HDFS
- Validating the data set
for i in dataReduceByKey.collect(): print(i)
Here is the video which explain the details of word count program