Apache Hive is open source tool to provide an interface to create logical databases for structured data and then provide interface to load/insert data as well as process data using Hive Query Language (similar to SQL). Data will be stored in HDFS. If data is structured then tables can be created. Metadata generated with table creation commands will be stored in RDBMS database such as MySQL.
As depicted in the picture, Hive requires
- Hive binaries (a bunch of jar files which have compiled java code)
- Hadoop cluster (with HDFS and YARN+MR, to store data using HDFS and YARN+MR to process the data)
- RDBMS database (to store metadata)
- Most of the hive queries will be compiled into map reduce programs and submitted as Map Reduce jobs.
- As technology stack evolve, hive is supported with other execution frameworks such as Tez, Spark etc
- For certification point of view, only Hive DDL as well as data load commands are important
Planning Data Analysis training
Here is the introduction video about the plan for the preparation for “Data Analysis” category of the certification.
Here is the video which will explain overall Hive architecture in detail.