Let us see the first problem. In this we will import the data from mysql table, do little transformation and then export it back to mysql table
- Using sqoop, import orders table into hdfs to folders /user/cloudera/problem1/orders. File should be loaded as Avro File and use snappy compression
- Using sqoop, import order_items table into hdfs to folders /user/cloudera/problem1/order-items. Files should be loaded as avro file and use snappy compression
- Using Spark Scala load data at /user/cloudera/problem1/orders and /user/cloudera/problem1/orders-items items as dataframes.
- Expected Intermediate Result: Order_Date , Order_status, total_orders, total_amount. In plain english, please find total orders and total amount per status per day. The result should be sorted by order date in descending, order status in ascending and total amount in descending and total orders in ascending. Aggregation should be done using below methods. However, sorting can be done using a dataframe or RDD. Perform aggregation in each of the following ways
- a). Just by using Data Frames API – here order_date should be YYYY-MM-DD format
- b). Using Spark SQL – here order_date should be YYYY-MM-DD format
- c). By using combineByKey function on RDDS — No need of formatting order_date or total_amount
- Store the result as parquet file into hdfs using gzip compression under folder
- Store the result as parquet file into hdfs using snappy compression under folder
- Store the result as CSV file into hdfs using No compression under folder
- create a mysql table named result and load data from /user/cloudera/problem1/result4a-csv to mysql table named result
Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario.