You need to have right data sets for practice. More than volume, it is important to have functionally correct data so that one can come up with realistic use cases.
Following are the data sets available in our GitHub account
- and more
If you are using your own environment, it is recommended to set up data sets and copy to your environment. In the lab we made data available both in local as well as HDFS
- Local Path: /data
- HDFS: /public