Hadoop1
Hadoop1
Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which
can’t be processed in an efficient manner with the help of traditional methodology
such as RDBMS. Hadoop has made its place in the industries and companies that
need to work on large data sets which are sensitive and needs efficient handling.
Hadoop is a framework that enables processing of large data sets which reside in the
form of clusters. Being a framework, Hadoop is made up of several modules that are
supported by a large ecosystem of technologies.
1. Hadoop Distributed File System (HDFS): This is the storage component of Hadoop,
designed to store large files across multiple machines in a distributed manner.
The Hadoop ecosystem is continuously evolving, with new tools and technologies
being developed to address different aspects of big data processing, storage,
and analysis. Its flexibility and scalability make it a popular choice for organizations
dealing with large volumes of data.
Experiment 2
Basic HDFS Commands
a. ls: This command is used to list all the files. Use lsr for recursive approach. It is
useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so,
bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File
System) commands.
d. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This
is the most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
(OR)
f. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR)
h. cp: This command is used to copy files within hdfs. Lets copy folder geeks to
geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
i. mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt
from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
j. rmr: This command deletes a file from HDFS recursively. It is very useful command
when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.
Note: There are more commands in HDFS but we discussed the commands which are
commonly used when working with Hadoop. You can check out the list of dfs commands
using the following command: bin/hdfs dfs
Experiment 3
Hadoop filesystem navigation and manupulation using commands
To use HDFS commands, start the Hadoop services using the following command:
sbin/start-all.sh
jps
Below cover several basic HDFS commands, along with a list of more File system
commands given command -help.
mkdir:
To create a directory, similar to Unix ls command.
Options:
-p : Do not fail if the directory already exists
ls:
List directories present under a specific directory in HDFS, similar to Unix ls command. The
-lsr command can be used for recursive listing of directories and files.
Options:
-d : List the directories as plain files
-h : Format the sizes of files to a human-readable manner instead of number of bytes
-R : Recursively list the contents of directories
copyFromLocal:
Copy files from the local file system to HDFS, similar to -put command. This command will
not work if the file already exists. To overwrite the destination if the file already exists, add -f
flag to command.
Options:
copyToLocal:
Copy files from HDFS to local file system, similar to -get command.
cat:
Display contents of a file, similar to Unix cat command.
cp:
Copy files from one directory to another within HDFS, similar to Unix cp command.
mv:
Move files from one directory to another within HDFS, similar to Unix mv command.
rm:
Remove a file from HDFS, similar to Unix rm command. This command does not delete
directories. For recursive delete, use command -rm -r.
Options:
-r : Recursively remove directories and files
-skipTrash : To bypass trash and immediately delete the source
-f : Mention if there is no file existing
-rR : Recursively delete directories
getmerge:
Merge a list of files in one directory on HDFS into a single file on local file system. One of the
most important and useful commands when trying to read the contents of map reduce job or
pig job’s output files.
setrep:
Change replication factor of a file to a specific instead of default replication factor for
remaining in HDFS. If it is a directory, then the command will recursively change in the
replication of all the files in the directory tree given the input provided.
Options:
-w : Request the command wait for the replication to be completed (potentially takes a long
time)
-r : Accept for backwards compatibility and has no effect
touchz:
Creates an empty file in HDFS.
test:
Test an HDFS file’s existence of an empty file or if it is a directory or not.
Options:
-w : Request the command wait for the replication to be completed (potentially takes a long
time)
-r : Accept for backwards compatibility and has no effect
appendToFile:
Appends the contents of all given local files to the provided destination file on HDFS. The
destination file will be created if it doesn’t already exist.
$ hadoop fs -appendToFile
chmod:
Change the permission of a file, similar to Linux shell’s command but with a few exceptions.
<MODE> Same as mode used for the shell’s command with the only letters recognized are
‘rwxXt’
<OCTALMODE> Mode specified in 3 or 4 digits. It is not possible to specify only part of the
mode, unlike the shell command.
Options:
-R : Modify the files recursively
chown:
Change owner and group of a file, similar to Linux shell’s command but with a few
exceptions.
Options:
-R : Modify the files recursively
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH
df:
Show capacity (free and used space) of the file system. The status of the root partitions are
provided if the file system has multiple partitions and no path is specified.
Options:
-h : Format the sizes of files to a human-readable manner instead of number of bytes
du:
Show size of each file in the directory.
Options:
-s : Show total summary size
-h : Format the sizes of files to a human-readable manner instead of number of bytes
tail:
Show the last 1KB of the file.
Options:
-f : Show appended data as the file grows
In Hadoop Distributed File System (HDFS), you can perform various file management tasks
using command-line interfaces or programming languages like Java. Let's cover how you
can achieve these tasks:
bash
hdfs dfs -put localfile.txt /user/hadoop/destination_directory/
Replace localfile.txt with the local file path and /user/hadoop/destination_directory/ with the
HDFS directory where you want to place the file.
bash
hdfs dfs -mkdir /user/hadoop/new_directory
This will create a new directory named new_directory under the /user/hadoop/ directory.
B) Retrieving Files:
Use the hdfs dfs -get command to retrieve files from HDFS to your local file system:
bash
hdfs dfs -get /user/hadoop/source_directory/file.txt localfile.txt
Replace /user/hadoop/source_directory/file.txt with the HDFS file path and localfile.txt with
the destination path in your local file system.
C) Deleting Files:
For deleting files or directories in HDFS, you can use the hdfs dfs -rm command for files or
hdfs dfs -rm -r for directories:
1. Delete File:
bash
hdfs dfs -rm /user/hadoop/file_to_delete.txt
Replace /user/hadoop/file_to_delete.txt with the path of the file you want to delete.
bash
hdfs dfs -rm -r /user/hadoop/directory_to_delete
Replace /user/hadoop/directory_to_delete with the directory path you want to delete along
with its contents.
Make sure to exercise caution while performing delete operations, especially for directories,
as the -r flag removes them recursively.
These commands can be executed in the terminal or command prompt when connected to a
machine with Hadoop installed and configured, and the appropriate permissions are granted
for file manipulation in HDFS. Adjust paths and filenames as per your specific HDFS
directory structure and file names.
Experiment 6
Process different datasets using pig.
Pig is a powerful tool for processing various datasets using its data flow language, Pig Latin.
Let's consider a scenario where you have multiple datasets, and you want to perform
operations on them using Pig.
Sample Datasets:
Let's say you have two datasets:
pig
-- Load the users data
users = LOAD 'users.csv' USING PigStorage(',') AS (user_id:int, name:chararray, age:int,
gender:chararray)
-- Load the transactions data
transactions = LOAD 'transactions.csv' USING PigStorage(',') AS (user_id:int,
product:chararray, amount:int);
-- Join the datasets on user_id
joined_data = JOIN users BY user_id, transactions BY user_id;
-- Group the joined data by user_id and calculate the total amount spent by each user
total_spent = FOREACH (GROUP joined_data BY users::user_id) {
user = group.users::user_id;
total = SUM(transactions.amount);
GENERATE user AS user_id, SUM(transactions.amount) AS total_amount_spent;
}
-- Store the results back to HDFS
STORE total_spent INTO 'output/total_spent_by_user' USING PigStorage(',');
Running the Pig Script:
To execute this Pig script, save it as data_processing.pig and use the following command in
the terminal or command prompt:
bash
pig data_processing.pig
This script will join the two datasets based on the user_id, calculate the total amount spent
by each user, and store the results in the HDFS directory output/total_spent_by_user.
Adjust file paths, delimiter, and column names in the script according to your actual dataset
structure and locations in HDFS.