As part of this lesson we will look into following highlighted topics related to Hadoop Cluster Planning.
- Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
- Analyze the choices in selecting an OS
- Understand kernel tuning and disk swapping
- Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
- Given a scenario, determine the ecosystem components your cluster needs to run in order to fulfill the SLA
- Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O
- Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
- Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario
Cluster Sizing – Analysis
Following are the criteria to decide on the capacity of the cluster
- Workload – Identify the workload on the cluster including all the applications that are part of Hadoop eco system as well as complement applications.
- CPU – Need to count the number of cores configured in the cluster
- Memory – Total memory in the cluster
- Storage – Amount of data that can be stored in the cluster
- Disk I/O – Amount of read and write operations in the cluster
- Cloudera displays all this information as charts in cloudera manager home page
Disk Sizing and Configuration
Here are the details about disk sizing and configuration.
- JBOD – Just Bunch Of Disks
- JBOD should be used to mount storage on to slave nodes for HDFS
- RAID should not be used as fault tolerance is implemented by replication factor
- LVM should be disabled
- RAID configuration might be considered to store edit logs and fs image of name node.
- SAN (network storage)
- SAN might be used for a copy of edit logs and fs image but not for HDFS
- Virtualization should not be used.
- Disk Sizing Requirements
- One hard drive (1-2 TB), 2 cores and 6-8 GB RAM works well for most of the configurations.
- Disk sizing requirements for HDFS = Size of data that needs to be stored * average replication factor
- If you want to store 100 TB of data with average replication factor of 3.5, then 350 TB of storage needs to be provisioned
- Network usage in Hadoop
- Cluster housekeeping traffic (minimal)
- Client metadata operations on namenode (minimal)
- Block data transfer (can be network intensive, eg: disk/node failure)
- Map Reduce
- Shuffle and Sort phase between mapper and reducer will use network
- Network design
- 1 Gb – Cheaper
- 10 Gb – expensive but performance might not benefit much for HDFS and Map Reduce (might help HBase)
- Fiber optics need not be necessary
- North/South traffic pattern
- East/West traffic pattern (Hadoop exhibits)
- Tree structure vs. Spine Fabric