HDFS - Data Read Operation
Last Updated :
17 Nov, 2022
HDFS is a distributed file system that stores data over a network of commodity machines. HDFS works on the streaming data access pattern means it supports write-ones and read-many features. Read operation on HDFS is very important and also very much necessary for us to know while working on HDFS that how actually reading is done on HDFS(Hadoop Distributed File System). Let's understand how HDFS data read works.
Reading on HDFS seems to be simple but it is not. Whenever a client sends a request to HDFS to read something from HDFS the access to the data or DataNode where actual data is stored is not directly granted to the client because the client does not have the information about the data i.e. on which DataNode data is stored or where the replica of data is made on DataNodes. Without knowing information about the DataNodes the client can never access or read data from HDFS.
So, that's why the client first sends the request to NameNode since the NameNode contains all the metadata or information we require to perform read operation on HDFS. Once the request is received by the NameNode it responds and sends all the information like the number of DataNodes, the location where the replica is made, the number of data blocks and their location, etc to the client. Now the client can read data with all this information provided by the NameNode. The client reads the data parallelly since the replica of the same data is available on the cluster. Once the whole data is read it combines all the blocks as the original file.
Let's understand data read on HDFS with a suitable diagram
Components that we have to know before learning HDFS read operation.
NameNode: The primary purpose of Namenode is to manage all the MetaData. As we know the data is stored in the form of blocks in a Hadoop cluster. So on which DataNode or on which location that block of the file is stored is mentioned in MetaData. Log of the Transaction happening in a Hadoop cluster, when or who read or write the data, all this information will be stored in MetaData.
DataNode: DataNode is a program run on the slave system that serves the read/write request from the client and used to store data in form of blocks.
HDFS Client: HDFS Client is an intermediate component between HDFS and the user. It communicates with the Datanode or Namenode and fetches the essential output that the user requests.

In the above, image we can see that first, we send the request to our HDFS client which is a set of programs. Now, this HDFS client contacts the NameNode because it has all information or metadata about the file we want to read. The NamoNode responds and then sends all the metadata back to the HDFS client. Once the HDFS client knows from which location it has to pick the data block, It asks the FS Data Input Stream to point out those blocks of data on data nodes. The FS Data Input Stream then does some processing and made this data available for the client.
Let's see the way to read data from HDFS.
Using HDFS command:
With the help of the below command, we can directly read data from HDFS(NOTE: Make sure all of your Hadoop daemons are running).
Commands to start Hadoop Daemons
start-dfs.sh
start-yarn.sh
Syntax For Reading Data From HDFS:
hdfs dfs -get <source-path> <destination-path> # here source path is file path on HDFS that we want to read
# destination path is where we want to store the read file on local machine
Command
In our case, we have one file with the name dikshant.txt with some data on the HDFS root directory. The below command, we can use to list data on the HDFS root directory.
hdfs dfs -ls /

the below command will read the data from the root directory of HDFS and stores it in the /home/dikshant/Desktop location on my local machine.
hdfs dfs -get /dikshant.txt /home/dikshant/Desktop

In the below image we can observe that the data is successfully read and stored in /home/dikshant/Desktop directory and now we can see the content of it by opening this file.

Similar Reads
Read and Write operations in Memory
A memory unit stores binary information in groups of bits called words. Data input lines provide the information to be stored into the memory, Data output lines carry the information out from the memory. The control lines Read and write specifies the direction of transfer of data. Basically, in the
3 min read
Java.io.DataInputStream class in Java | Set 1
A data input stream enables an application to read primitive Java data types from an underlying input stream in a machine-independent way(instead of raw bytes). That is why it is called DataInputStream - because it reads data (numbers) instead of just bytes. An application uses a data output stream
3 min read
Various terms in File System
Prerequisite - File Systems in Operating System First Understand the structure of storage device HDD: Terminology: (i). Sector (ii). Track (iii). Track-Sector (iv). Cluster or Block Figure - Terminology in diskThese are explained as following below in brief. (i). Sector: This is a pie like structure
2 min read
File Access Methods in Operating System
File access methods in an operating system are the techniques and processes used to read from and write to files stored on a computer's storage devices. There are several ways to access this information in the file. Some systems provide only one access method for files. Other systems, such as those
10 min read
Levels in a File Management System
Prerequisite - File System The management of files and the management of device are interlinked with each other. Given below is an hierarchy used to perform the required functions of an I/O system efficiently. The highest level module called Basic File System passes the information given to it to Lo
4 min read
Disk Management in Operating System
Disk management is one of the critical operations carried out by the operating system. It deals with organizing the data stored on the secondary storage devices which includes the hard disk drives and the solid-state drives. It also carries out the function of optimizing the data and making sure tha
8 min read
What is Data Recovery ?
Data recovery is the process of retrieving or restoring lost, corrupted, deleted, or inaccessible data from storage devices such as hard drives, solid-state drives (SSDs), USB drives, memory cards, and other media. What is Data Recovery?Data recovery is the process of restoring data that has been lo
6 min read
hdparm command in Linux with Examples
"hdparm" (i.e., hard disk parameter) is one of the command line programs for Linux that is used to handle disk devices and hard disks. With the help of this command, you can get statistics about the hard disk, alter writing intervals, acoustic management, and DMA settings. It can also set parameters
4 min read
File Handling in COBOL
File Handling is an important feature in COBOL Language. Only structured data in the form of files are handled. Files consist of Records and Fields, which contain information. FieldA field can be defined as a collection of information that is required to be given a structure. Each field stores data
12 min read
Transforming of I/O Requests to Hardware Operations
We know that there is handshaking between device driver and device controller but here question is that how operating system connects application request or we can say I/O request to set of network wires or to specific disk sector or we can say to hardware -operations. To understand concept let us c
4 min read