HDFS Introduction and Architecture

 

Hadoop Distributed File System (HDFS) is a distributed file system configured above the local file system. HDFS is an efficient and reliable distributed storage system which uses commodity hardware. It stores data in the form of blocks & replicates these blocks across the cluster to ensure the availability and fault tolerance.

By default, replication is set to be 3 in hadoop. HDFS supports horizontal scaling over commodity hardware to catch with increasing needs of resources in a cluster.

HDFS has two types of nodes or components namely: Namenode and Datanode. In a hadoop cluster there can be only one Namenode and any number of datanodes. In a single node cluster, we have same machine working as Namenode as well as datanode. In multi-node cluster we can have a node working only as Namenode while other nodes working as datanodes in the cluster.

In multi-node cluster the data blocks are stored and replicated as per replication factor over data nodes in such a way that at least one copy of each block resides in same rack to give benefit of spatial locality in case of slow or no response from first request while another replica is kept on separate rack to deal with hardware or power failure issue with any rack.

https://lh5.googleusercontent.com/SnbpZ0r7CODi-vxXtBHALW6Z7QNzSnHC3eHLutFgNLc6IYKSnhntnNplrE1z457s-T69FgFHtw1wfbzqALnQ3Op9E8TE7wAti4cVwXA79gT8nQ9VwyiuF8GQ38HA2IRHuMbbHHSi

Image source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

  • Namenode: Namenode works as a single-entry point as well as master node in the cluster. It stores metadata information for any file in hdfs. Namenode provides the interface to the end use to interact with hadoop cluster.

  • Datanode: Datanode works as a worker or slave node in hadoop cluster. Namenode keeps track of all datanodes in the cluster with the help of heartbeat messages. Datanodes store actual data (files) in the form of blocks.

In a hadoop cluster you can see these components running as a Java process. If you have a hadoop cluster (a machine with installed hadoop) then go to terminal and give the command

  • start-all.sh to start all components of hdfs and mapreduce (yarn)

  • jps to see the status of running components