Let us revise how to read data from files and process using Map Reduce APIs. In case, you are newbie for Scala and want to learn it in detail, please go through this course.
Setting up Data Sets
Before we apply logic to process the data, let us set up data sets so that we can come up with a bit realistic use cases in the pursuit of learning these important concepts.
Performing I/O Operations
Scala provide few APIs to read data from files. If you want to understand File I/O in detail you can start exploring Java APIs and use as part of Scala programming. Let us see steps involved in reading data into a collection.
- import scala.io.Source
- Use Source.fromFile to read data into memory as character array/buffered source
- Use getLines to use new line character as delimiter and create collections
- We read data from order_items for demonstration purpose into a variable called orderItems
Using map reduce APIs
- Get all order items for order id 2 – use filter API
- Extract order item subtotal for each of the item belonging to order id 2 – use map API
- Add order item subtotal to get revenue for order id 2 – use reduce API
Tuple is yet another powerful structure which is being used in many modern programming languages.
- It is generic object type, where attributes can be accessed using index (underscore notation)
- Tuple represents a record with elements of heterogeneous type
- It eliminates creating classes and then instantiating objects referring to class variables.
- Knowledge of tuples is very important to get into Spark later.