Let us see how we can use Spark SQL in both Native as well as Hive Context to apply variety of transformation.
This problem helps you strengthen and validate skills related to data analysis objective of the certification exam.
Data model in mysql on cloudera VM looks like this.
Note: only primary and foreign keys are included in the relational schema diagram shown below
Provide two solutions for steps 2 to 7
- Using HIVE QL over Hive Context
- Using Spark SQL over Spark SQL Context or by using RDDs
- create a hive meta store database named problem6 and import all tables from mysql retail_db database into hive meta store.
- On spark shell use data available on meta store as source and perform step 3,4,5 and 6. [this proves your ability to use meta store as a source]
- Rank products within department by price and order by department ascending and rank descending [this proves you can produce ranked and sorted data on joined data sets]
- find top 10 customers with most unique product purchases. if more than one customer has the same number of product purchases then the customer with the lowest customer_id will take precedence [this proves you can produce aggregate statistics on joined datasets]
- On dataset from step 3, apply filter such that only products less than 100 are extracted [this proves you can use subqueries and also filter data]
- On dataset from step 4, extract details of products purchased by top 10 customers which are priced at less than 100 USD per unit [this proves you can use subqueries and also filter data]
- Store the result of 5 and 6 in new meta store tables within hive. [this proves your ability to use metastore as a sink]
Try your best to solve the above scenario without going through the solution below. If you could then use the solution to compare your result. If you could not then I strongly recommend that you go through the concepts again (this time in more depth). Each step below provides a solution to the points mentioned in the Problem Scenario. Please go through the video for an indepth explanation of the solution.
NOTE: The same solution can be implemented using Spark SQL Context. Just replace Hive Context object with SQL Context object below. Rest of the solution remains the same. i.e same concept of querying, using temp table and storing the result back to hive.