As part of this topic we will cover below highlighted one from CCAH curriculum
- Describe the function of HDFS Daemons
- Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing.
- Identify current features of computing systems that motivate a system like Apache Hadoop.
- Classify major goals of HDFS Design
- Given a scenario, identify appropriate use case for HDFS Federation
- Identify components and daemon of an HDFS HA-Quorum cluster
- Analyze the role of HDFS security (Kerberos)
- Determine the best data serialization choice for a given scenario
- Describe file read and write paths
- Identify the commands to manipulate files in the Hadoop File System Shell
Major Goals for HDFS Design
- Distributed – using block size, default 128 MB
- Hardware Failure – detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS – using replication factor, default 3.
- Streaming Data Access
- Applications that run on HDFS need streaming access to their data sets.
- They are not general purpose applications that typically run on general purpose file systems.
- HDFS is designed more for batch processing rather than interactive use by users.
- The emphasis is on high throughput of data access rather than low latency of data access.
- POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS.
- POSIX semantics in a few key areas has been traded to increase data throughput rates.
- Large Data sets – tuned for large data sets
- Simple Coherency Model – write-once-read-many, HDFS files are immutable
- Data Locality (Moving computation to data)
- Portability Across Heterogeneous Hardware and Software Platforms (Logical file system)