Kafka is getting popular in ingesting data in streaming fashion from different sources to different targets due to several reasons.
- Highly reliable
- Highly scalable
- Robust APIs
- Easy to integrate
High level architecture
Kafka contains several components
- Zoo Keeper – to keep Kafka brokers up and running
- Kafka Broker – who manages topics
- Kafka Cluster contains multiple brokers
Kafka provide APIs for
- Producers – to publish messages into Kafka topic
- Consumers – to consume messages from Kafka topic
- Stream processors
- Connectors (such as Database)
Sample Commands – to validate
Kafka provides command line utilities to
- manage topics – create, list and delete
- publish messages to topic
- consume messages from topic
Make sure environment variables are set to use the commands.
Anatomy of a topic
Kafka topic have important properties like
- partition – for scalability
- replication – for reliability
We can define number of partitions and replication factor while creating the topics. Consumers maintain offsets to understand till the position of messages that are read in each of the partition.
Role of Kafka and Flume
Even though Kafka is increasingly used as part of the new applications, there is resistance for legacy applications. Hence in case of legacy applications, Flume and Kafka complement well.
- Read data from legacy applications using Flume
- Use Kafka sink to push messages to Kafka topic
- Consume messages from Kafka and process using technologies like Spark Streaming