ASSIGNMENT
Hands-on with HDFS Task:
Install Hadoop in pseudo-distributed mode or use an online simulator.
Upload and retrieve a sample file using HDFS commands.
Deliverable: Screenshots of steps + command list. Evaluation Criteria:
Execution, clarity of explanation.
Steps to Install Hadoop in Pseudo-Distributed Mode (Conceptual with
Command Examples):
1. Install Java:adoop requires Java. Let's assume you have it installed.
You can check with:
```bash
java -version
```
2. Download and Extract Hadoop:
Let's say you download the Hadoop binary (e.g., `hadoop-
3.3.6.tar.gz`) to your home directory.
```bash
tar -xzvf hadoop-3.3.6.tar.gz
cd hadoop-3.3.6
```
3. Set Environment Variables: You'll need to configure your `~/.bashrc`
or `~/.zshrc` file. Add the following lines (adjust the path if your Hadoop
directory is different):
```bash
export HADOOP_HOME=/home/$USER/hadoop-3.3.6
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
```
Then, apply the changes:
```bash
source ~/.bashrc
“’
4. Configure Hadoop Configuration Files: Navigate to the `etc/hadoop`
directory within your Hadoop installation. You'll need to edit a few key
files:
‘hadoop-env.sh`: Set the `JAVA_HOME` variable.
```bash
nano etc/hadoop/hadoop-env.sh
```
Add or uncomment the line similar to:
```bash
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
```
‘core-site.xml`: Configure the HDFS default name node.
```bash
nano etc/hadoop/core-site.xml
```
Add the following within the `<configuration>` tags:
```xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
```
`hdfs-site.xml`: Configure the HDFS data node directory.
```bash
nano etc/hadoop/hdfs-site.xml
```
Add the following within the `<configuration>` tags:
```xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/tmp/hadoop-data</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/tmp/hadoop-name</value>
</property>
```
`mapred-site.xml`: Configure MapReduce execution mode. You might
need to rename the template file first:
```bash
cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-
site.xml
nano etc/hadoop/mapred-site.xml
```
Add the following within the `<configuration>` tags:
```xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
```
```
Add the following within the `<configuration>` tags:
```xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
```
5. This initializes the HDFS file system.
```bash
hdfs namenode -format
```
6.Start Hadoop Services:
```bash
start-dfs.sh
start-yarn.sh
```
7.Access Hadoop Web UIs (Optional but useful for monitoring)
NameNode: `http://localhost:9870` (or `http://localhost:5007` for older
versions)
```bash
echo "This is a sample file for Hadoop HDFS." > sample.txt
```
Now, let's use HDFS commands:
1. Create a directory in HDFS (optional but good practice):
```bash
hdfs dfs -mkdir /user/$USER/input
```
2. Upload the local file to HDFS:
```bash
hdfs dfs -put sample.txt /user/$USER/input/
```
3. List the files in the HDFS directory:
```bash
hdfs dfs -ls /user/$USER/input/
```
4.
```bash
hdfs dfs -get /user/$USER/input/sample.txt retrieved_sample.txt
```
5.
```bash
cat retrieved_sample.txt
```
6. ```bash
stop-yarn.sh
stop-dfs.sh
```