Hadoop Installation on Ubuntu (Single Node Cluster)
Step 1: Check Java and Hadoop Versions
Before installing Hadoop, it is important to check whether Java is installed because
Hadoop is built on Java and requires it to function. The java -version command verifies the
Java installation, ensuring compatibility. Similarly, hadoop version checks if Hadoop is
already installed to avoid conflicts when setting up a new version. If Java is missing, we
must install it before proceeding with Hadoop installation.
Before installing Hadoop, ensure that Java is installed.
java -version
hadoop version
Step 2: Update and Upgrade the System
Running sudo apt update and sudo apt upgrade -y ensures that all system packages are up
to date. This step prevents dependency issues while installing new software like Java and
Hadoop. Updating the package list ensures we get the latest versions, and upgrading
applies security patches and software improvements.
Updating ensures that all the installed packages are up to date.
sudo apt update
sudo apt upgrade -y
The -y flag automatically confirms updates.
Step 3: Install Java
Hadoop requires Java to execute its processes. OpenJDK 11 is a stable, widely used version
that works well with Hadoop 3.x. By installing it with sudo apt install openjdk-11-jdk -y, we
ensure that Hadoop has the necessary Java runtime environment. This step is crucial
because, without Java, Hadoop will not function.
Hadoop requires Java to run. Install OpenJDK 11 using:
sudo apt install openjdk-11-jdk -y
java -version
Step 4: Download and Extract Hadoop
Hadoop is downloaded from Apache’s official website using wget. The command fetches
the Hadoop package (hadoop-3.3.6.tar.gz), which is then extracted using tar -xvzf. This
unpacks Hadoop into a directory. Finally, the extracted folder is moved to
/usr/local/hadoop, a common location for system-wide software installations. This makes
Hadoop easily accessible to all users on the system.
Download Hadoop from the official Apache website.
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Extract the downloaded file:
tar -xvzf hadoop-3.3.6.tar.gz
Move Hadoop to the /usr/local directory for system-wide access:
sudo mv hadoop-3.3.6 /usr/local/Hadoop
Step 5: Configure Environment Variables
After installation, we need to configure environment variables to make Hadoop and Java
easily executable from any terminal session. This is done by adding the Hadoop and Java
paths to ~/.bashrc. We define JAVA_HOME, HADOOP_HOME, PATH, and
HADOOP_CONF_DIR, ensuring that the system recognizes Hadoop commands without
requiring full paths.
Edit the ~/.bashrc file to set up Hadoop and Java paths.
nano ~/.bashrc
Add the following lines at the end of the file:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/Hadoop
Save and exit (Ctrl + X, then Y, then Enter).
Apply the changes:
Once the environment variables are added, they need to be applied for the current
session. Running source ~/.bashrc reloads the bash profile so that changes take effect
immediately without needing to restart the terminal. This ensures that any Hadoop-
related commands work as expected.
Source ~/.bashrc
Step 6: Enable SSH for Hadoop
Hadoop requires SSH (Secure Shell) for communication between nodes in a distributed
environment. Even in a single-node setup, SSH is needed to start and stop Hadoop services
without manually logging in each time. This step is essential because Hadoop’s daemons
interact over SSH.
To enable password-less SSH login, we generate an SSH key pair using ssh-keygen -t rsa -P
"" -f ~/.ssh/id_rsa. The public key is then added to the authorized_keys file using cat
~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys. This setup allows Hadoop daemons to
communicate securely without repeatedly asking for passwords, which is crucial for
automation.
Hadoop requires passwordless SSH access.
ssh localhost
If SSH is not installed, install it using:
sudo apt install ssh -y
Generate SSH keys and configure passwordless SSH:
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Now, verify SSH:
ssh localhost
Step 7: Configure Hadoop Files
Core-Site Configuration:
The core-site.xml file specifies Hadoop’s core settings. The <fs.defaultFS> property is set to
hdfs://localhost:9000, defining the default Hadoop filesystem as HDFS. The
<hadoop.tmp.dir> property sets a temporary directory for Hadoop’s intermediate
operations. This configuration is necessary to initialize and manage HDFS correctly.
Edit the core-site.xml file:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following content:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>A base directory for HDFS and other temporary files.</description>
</property>
</configuration>
Save and exit.
HDFS-Site Configuration:
The hdfs-site.xml file configures the Hadoop Distributed File System (HDFS). The
<dfs.replication> property is set to 1, meaning each file block is stored only once, which is
ideal for a single-node setup. <dfs.namenode.name.dir> and <dfs.datanode.data.dir>
specify directories for storing metadata and actual file data, ensuring proper data
organization.
Edit the hdfs-site.xml file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following content:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Number of replicas for HDFS blocks (set to 1 for single-node
cluster).</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/namenode</value>
<description>Directory for Namenode metadata.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hdfs/datanode</value>
<description>Directory for Datanode storage.</description>
</property>
</configuration>
Save and exit
MapReduce Configuration:
This file configures the MapReduce framework in Hadoop. The
<mapreduce.framework.name> property is set to yarn, meaning Hadoop will use YARN to
manage computational resources. <mapreduce.jobhistory.address> is set to
localhost:10020, enabling the job history server to track completed MapReduce jobs. This
configuration is essential for executing and monitoring Hadoop jobs.
Edit the mapred-site.xml file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following content:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
</configuration>
Save and exit
YARN Configuration:
The yarn-site.xml file sets up YARN, the resource management layer of Hadoop.
<yarn.resourcemanager.hostname> is set to localhost, defining where the
ResourceManager will run. <yarn.nodemanager.aux-services> is set to mapreduce_shuffle,
enabling data shuffling for MapReduce jobs. These settings ensure that YARN efficiently
schedules and executes tasks.
Edit the yarn-site.xml file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Add the following content:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save and exit
Step 8: Format Namenode:
Before starting Hadoop for the first time, the NameNode must be formatted using hdfs
namenode -format. This command initializes the HDFS metadata and clears any previous
data. Without formatting, the system might face inconsistencies, preventing Hadoop from
functioning correctly. This step is only required for the first setup.
Before starting Hadoop, format the HDFS Namenode:
hdfs namenode -format
Step 9: Start Hadoop Services:
To launch Hadoop, we run start-dfs.sh to start HDFS services (NameNode and DataNode)
and start-yarn.sh to start YARN (ResourceManager and NodeManager). These scripts
initialize the distributed storage and resource management layers of Hadoop. Running
them ensures that the cluster is up and ready for processing tasks.
Start the Hadoop Distributed File System (HDFS):
start-dfs.sh
start-yarn.sh
Step 10: Verify Running Services:
After setting up Hadoop, we use the jps command to list all running Java processes. This
helps verify if essential Hadoop daemons like NameNode, DataNode, ResourceManager,
and NodeManager are running properly. If any service is missing, troubleshooting is
needed before proceeding.
Check if Hadoop processes are running:
Jps
Expected output:
NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager
Step 11: Hadoop Web Interfaces
You can access the following Hadoop web UIs:
Service URL Description
NameNode UI http://localhost:9870/ Shows HDFS file system status.
ResourceManager UI http://localhost:8088/ Monitors running applications in
YARN.
DataNode UI http://localhost:9864/ Displays DataNode status.
NodeManager UI http://localhost:8042/ Shows NodeManager details.
Hadoop provides web interfaces for real-time monitoring:
NameNode UI (http://localhost:9870/): Shows HDFS status, including storage
capacity and active nodes.
ResourceManager UI (http://localhost:8088/): Displays running and completed
YARN applications.
DataNode UI (http://localhost:9864/): Monitors individual DataNode health.
NodeManager UI (http://localhost:8042/): Shows the status of compute nodes.
These web UIs are useful for troubleshooting and observing cluster activity.