We will try to process the data using SQL queries
- Import all tables from mysql database into hdfs as avro data files. use compression and the compression codec should be snappy. data warehouse directory should be retail_stage.db
- Create a metastore table that should point to the orders data imported by sqoop job above. Name the table orders_sqoop.
- Write query in hive that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from orders_sqoop.
- query table in impala that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from order_sqoop.
- Now create a table named retail.orders_avro in hive stored as avro, the table should have same table definition as order_sqoop. Additionally, this new table should be partitioned by the order month i.e -> year-order_month.(example: 2014-01)
- Load data into orders_avro table from orders_sqoop table.
- Write query in hive that shows all orders belonging to a certain day. This day is when the most orders were placed. select data from orders_avro
- evolve the avro schema related to orders_sqoop table by adding more fields named (order_style String, order_zone Integer)
- insert two more records into orders_sqoop table.
Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario.
Step 1 – Import all tables
Step 2 – Review and create hive external table
Step 3 – Run Hive query
Step 4 – Using Impala
- Launch impala shell by using impala-shell command
invalidate metadata– it will sync hive metadata with impala
- Tables will be now visible in impala as well, confirm by running
- Finally run above queries in impala. Some of the queries using windowing functions might not run in impala