Let us get started with Spark Streaming.
- Spark Streaming is a module which provide APIs to process streaming data
- It can be integrated with any streaming ingestion technology such as Flume, Kafka etc
Spark Streaming is used to process data in streaming fashion.
- It requires web service called StreamingContext
- Unlike SparkContext, StreamingContext runs perpetually processing data at regular intervals
- We cannot have multiple contexts running at same time, hence if there is running SparkContext we need to stop it before we launch StreamingContext
Setting up Netcat
Netcat, which is a web service can be used to get started.
- Start netcat service using host name and port number
- We can start publishing messages to this web service
Streaming Word Count – Development
Let us develop the program to perform word count.
- Add dependencies for spark streaming to build.sbt
- import org.apache.spark.streaming._ to import all the APIs
- Create Spark Configuration object with master and app name
- Pass Spark Configuration object and number of seconds to Streaming Context object. This will facilitate the Streaming Context to queue up the data for the interval equal to number of seconds and apply logic for processing.
- Develop necessary logic to perform streaming word count
- At the end use start and awaitTermination for Spark Streaming Context run perpetually
- Finally, build the jar file using sbt package command
Streaming Word Count – Deploy and Run
Overview of DStreams and APIs