Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
Typically Hadoop cluster can have few hundreds to few thousand nodes/physical servers. Setting up plain vanilla or Apache Hadoop and monitoring it can be tedious task and hence there are several distributions which provide tools for setting up and managing the clusters.
Major Hadoop Distributions
There are several distributions of Hadoop supported by respective vendors.
- Cloudera
- Hortonworks
- MapR
- Amazon EMR
These vendors provide training, support and services for the clients. As part of their distribution these vendors provide tools to simplify setup process of the cluster as well as to simplify the operations.
To do: Video will be provided to explain differences in detail
Hadoop eco system
Hadoop eco system can be divided into core components and other tools. HDFS – Hadoop Distributed File System is the foundation in Hadoop eco system, Map Reduce is Distributed Computing framework developed in tandem with HDFS and other tools can be categorized into Map Reduce based tools and non Map Reduce based tools
- Hadoop core components
- HDFS
- Map Reduce
- Map Reduce based tools
- Hive – Logical database on top of HDFS with SQL based interface on top of Map Reduce to process the data
- Pig – Data flow language based interface on top of Map Reduce to process the data in HDFS
- Sqoop – Generic data movement tool to copy data between relational databases and HDFS using Map Reduce leveraging its distributed processing capabilities
- Mahout – Machine learning library which uses Map Reduce framework to process the data
- Oozie – Map Reduce based work flow tool
- Non Map Reduce based tools
- Flume – Data integration tool using Flume agents which can collect streaming data from sources such as weblogs and load into target such as HDFS
- Spark – In memory data processing tool which can accelerate data processing
- Impala – Alternative to hive to process lower volumes of data in quicker and interactive fashion
- HBase – A NoSQL database to build applications which are operational in nature at scale
Vendors such as Cloudera, Hortonworks etc package all these tools as part of distributions and provide wizards and tools to set up and maintain on larger clusters
Job roles in Hadoop
There are several several specialized roles in Hadoop.
Job Role | Experience required | Desired Skills |
---|---|---|
Hadoop Developer | 0-7 Years | Hadoop, Programming using java, spark, hive, pig, sqoop etc |
Hadoop Administrator | 0-10 Years | Linux, Hadoop Administration using distributions |
Big Data Engineer | 3-15 Years | Data Warehousing, ETL, Hadoop, hive, pig, sqoop, spark etc |
Big Data Solutions Architect | 12-15 Years | Deep understanding of Big Data eco system such as Hadoop, NoSQL etc |
Infrastructure Architect | 12-15 Years | Deep understanding of infrastructure as well as Big Data eco system |
For most of the above job roles 2 years of hands on experience on Hadoop as part of overall years of experience suffice.
Hadoop Certifications
Vendors such as Cloudera, Hortonworks not only provide training and support, they also issue certifications which are highly recognized in the industry. Most of the certifications are practical oriented which tests level of understanding of test takers.
Why should one get certified?
- Tests level of understanding of several Hadoop eco system tools
- Instill confidence in individuals while delivering projects
- Certifications can give some traction in job search process
- Instills confidence in taking the interviews
- Separate certifications for separate roles
- Tests breadth and depth of eco system tools
- Most of the certifications are no more objective, they are scenario based simulating real world problems
Where should one get certified?
- Certifications issued by major Big Data vendors such as Cloudera, Hortonworks, Databricks are well recognized
- Most of these certifications are online and proctored
- Certifications can be taken from any where with Computer with Webcam
- No need to visit proctor centers
What are the certifications that are available?
- Administrators
- CCAH – Cloudera Certified Administrator of Apache Hadoop
- HDPCA – Hortonworks Data Platform Certified Administrator
- Developers
- CCA – Cloudera Certified Associate Spark and Hadoop Developer (HDFS, Sqoop, Flume, Spark with Python, Spark with Scala, Hive, Impala and Avro tools)
- HDPCD – Hortonworks Data Platform Certified Developer (Flume, Hive, Pig and Sqoop)
- HDPCD:Java – Hortonworks Data Platform Certified Developer (Java Map Reduce APIs)
- HDPCD:Spark – Hortonworks Data Platform Certified Developer (Spark)
- There is considerable amount of overlapping between CCA, HDPCD and HDPCD:Spark
- Data Engineers
- CCP DE – Cloudera Certified Professional Data Engineer (Sqoop, Flume, Hive and Oozie)
There are other certifications provided by other vendors as well, but these are most popular.
How can one prepare for certifications?
- itversity, llc is a startup which runs YouTube channel called itversity. Video content is published on the channel
- Content is developed based on published curriculum of respective certification
- Follow up videos are added based on feedback from the test takers
- Already around 40 people have acknowledged that they are certified following the content on the channel
How can one access the content to prepare for the certification?
Here is the table for certification to playlist mapping. Few of the certifications are still in progress. Please click on Certification Name to redirect to respective certification. There will be more blogs for each of this Hadoop Certification.
Category | Certification Name | Status |
---|---|---|
Administration | Cloudera Certified Administrator of Apache Hadoop (CCAH) | In progress |
Administration | Hortonworks Data Platform Certified Administrator (HDPCA) | Done |
Developer | Cloudera Certified Associate Spark and Hadoop Developer (CCA) HDFS, Sqoop, Flume, Spark with Python, Spark with Scala, Avro tools, Hive, Impala etc |
Done |
Developer | Hortonworks Data Platform Certified Developer (HDPCD) Hadoop, Sqoop, Flume, Hive, Pig etc |
Done |
Developer | Hortonworks Data Platform Certified Developer – Java (HDPCD:Java) | Almost Done |
Developer | Hortonworks Data Platform Certified Developer – Spark (HDPCD:Spark) | Not Started |
Data Engineer | Cloudera Certified Professional Data Engineer (CCPDE) | Just started |
Cloudera Certified Associate Spark and Hadoop Developer (CCA)
Hortonworks Data Platform Certified Administrator (HDPCA)
Cloudera Certified Associate Spark and Hadoop Developer (CCA)
Hortonworks Data Platform Certified Developer (HDPCD)
Hortonworks Data Platform Certified Developer – Java (HDPCD:Java)
How can one discuss further about Big Data or Certifications?
itversity, llc manages several linkedin groups and here is the list with URLs for Big Data or Certifications.
Stay connected with us!!!
Here are the details to stay connected with itversity, llc. Please click on these hyperlinks to stay connected using platform of your choice.
I want to start training with you.
Hi ,
I am interested in doing a Hadoop certification but not sure which one to consider. I am split between Cloudera Certified of Apache Hadoop or CCA.
My background is Software QA but i want to do these certification as they are very much in demand plus I will get to learn new technology.
– How much programming does these certifications requires? I am not very technical person so which one do you recommend for me?
– What are the job prospects? Are both the certifications in demand.
– How the process goes? We can register any time for certifications like I want to prepare before I take a date for the exam.
Please let me know any help will be really appreciated.
Thank you
Hi ,
As per the new syllabus, Spark coding can be done in Scala as well as Python. I have practice in Python and i just want to know whether that will be enough for the taking CCA Spark and Hadoop Developer certification exam.
Please suggest.
Thank you.
Hi,
i want to join your CCAH – Cloudera Certified Administrator of Apache Hadoop Course i want to do these certification, I am Working as Network Administrator. Please Suggest me how to do this i want to join Class if Available in pune City. Please i am waiting for the reply.
Thanks.
The course is designed keeping in mind the practical application and at the same time enabling the students to have detail knowledge on the subject. The syllabus is thus made with all the minute features in context to Big Hadoop Training in pune. Below enlisted are the features of this course.
http://hadooptraininginpune.online/
Sir , Are there any courses in your channel for Machine Learning/Deep Learning/Artificial Intelligence??
First of all I want to say wonderful blog! I had a quick question in which I’d like to
ask if you don’t mind. I was curious to know how you center yourself and clear your thoughts prior to writing.
I have had difficulty clearing my thoughts in getting my thoughts
out. I truly do take pleasure in writing however it just seems like
the first 10 to 15 minutes are generally lost simply
just trying to figure out how to begin. Any ideas or tips?
Cheers!