Apache Flume is open source tool which can capture data generated by web logs in real time to Hadoop eco system.
- Distributed and reliable
- It can collect, aggregate and move large amounts of log data
- It is robust, fault tolerant and tunable
- Uses a simple extensible data model that allows for online analytic application
- Flume User Guide will be provided at the time of taking certification exam.
Topics covered as part of this lesson
The skills to transfer data between external systems and your cluster. This includes the following:
- Import data from a MySQL database into HDFS using Sqoop
- Export data to a MySQL database from HDFS using Sqoop
- Change the delimiter and file format of data during import using Sqoop
- Ingest real-time and near-real time (NRT) streaming data into HDFS using Flume
- Load data into and out of HDFS using the Hadoop File System (FS) commands
Flume Introduction video
Different Flume agent configurations
- Flume supports the following mechanisms to read data from popular log stream types, such as:
- Agent needs to be installed to capture data from source into target and Flume supports several types of installations.
- Multi-agent flow
- In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.
- Multi-agent flow
- A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.
- This can be achieved in Flume by configuring a number of first tier agents with an avro sink, all pointing to an avro source of single agent (Again you could use the thrift sources/sinks/clients in such a scenario). This source on the second tier agent consolidates the received events into a single channel which is consumed by a sink to its final destination.
- Multiplexing the flow
- Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.
- The below example shows a source from agent “foo” fanning out the flow to three different channels. This fan out can be replicating or multiplexing. In case of replicating flow, each event is sent to all three channels. For the multiplexing case, an event is delivered to a subset of available channels when an event’s attribute matches a preconfigured value. For example, if an event attribute called “txnType” is set to “customer”, then it should go to channel1 and channel3, if it’s “vendor” then it should go to channel2, otherwise channel3. The mapping can be set in the agent’s configuration file.