HDFS commands are used to interact with files stored in HDFS.
- HDFS stands for Hadoop Distributed File System
- There are several commands in HDFS, but scope for this topic will be only data ingestion commands
- When files which contain data are copied to HDFS, files will be divided into 128 MB blocks and those blocks will be stored physically in Hadoop cluster. The component which manages storage of these physical files (for blocks) is called as Datanode. dfs.blocksize is the parameter which controls the size of each block.
- Each block will be copied on to multiple nodes for fault tolerance. By default it is 3 and defined by paramter called dfs.replication (replication factor)
- With 128 MB block size and replication factor 3, 1 GB file will be divided into 8 blocks, cloned into 3 copies (hence 8 * 3 = 24 records for 1 GB file).
- Mapping between file name, block names and block locations are stored in in-memory namespace and managed by Namenode.
Here is the video which cover relevant topics to store files in HDFS
Here is the video which cover relevant topics to copy files from HDFS
hadoop fsand hit enter will show list of the commands
- Make sure to copy the data set to the VM or the Gateway node
hadoop fs -copyFromLocalcommand to copy the data into HDFS as shown in the video
- It take 2 parameters
- first one is the path of the data to be copied in the local file system of VM or Gateway node
- second parameter is the path in HDFS to which data has to be copied