Let us try to get data in real time using flume.
This question focusses on validating your flume skills. You can either learn flume by following the video accompanied with this post or learn flume elsewhere and then solve this problem while using the video as a reference. This video serves both as tutorial and walkthrough of how to leverage flume for data ingestion.
Note: While this post only provides specifics related to solving the problem, the video provides an introduction, explanation and more importantly application of flume knowledge.
- This step comprises of three substeps. Please perform tasks under each subset completely
- using sqoop pull data from MYSQL orders table into /user/cloudera/problem7/prework as AVRO data file using only one mapper
- Pull the file from \user\cloudera\problem7\prework into a local folder named flume-avro
- create a flume agent configuration such that it has an avro source at localhost and port number 11112, a jdbc channel and an hdfs file sink at /user/cloudera/problem7/sink
- Use the following command to run an avro client flume-ng avro-client -H localhost -p 11112 -F <<Provide your avro file path here>>
- The CDH comes prepackaged with a log generating job. start_logs, stop_logs and tail_logs. Using these as an aid and provide a solution to below problem. The generated logs can be found at path /opt/gen_logs/logs/access.log
- run start_logs
- write a flume configuration such that the logs generated by start_logs are dumped into HDFS at location /user/cloudera/problem7/step2. The channel should be non-durable and hence fastest in nature. The channel should be able to hold a maximum of 1000 messages and should commit after every 200 messages.
- Run the agent.
- confirm if logs are getting dumped to hdfs.
- run stop_logs.