0% found this document useful (0 votes)
6 views

Hadoop1

The document provides an overview of the Hadoop ecosystem, detailing its components such as HDFS, MapReduce, and YARN, along with various tools that support big data processing. It also includes basic HDFS commands for file management and manipulation, illustrating how to create, copy, move, and delete files and directories within HDFS. Additionally, it touches on using Pig for processing datasets, emphasizing the flexibility and scalability of Hadoop for handling large volumes of data.

Uploaded by

itsabhi739
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Hadoop1

The document provides an overview of the Hadoop ecosystem, detailing its components such as HDFS, MapReduce, and YARN, along with various tools that support big data processing. It also includes basic HDFS commands for file management and manipulation, illustrating how to create, copy, move, and delete files and directories within HDFS. Additionally, it touches on using Pig for processing datasets, emphasizing the flexibility and scalability of Hadoop for handling large volumes of data.

Uploaded by

itsabhi739
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Experiment 1

Study of Hadoop ecosystem

Apache Hadoop is an open source framework intended to make interaction with big
data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which
can’t be processed in an efficient manner with the help of traditional methodology
such as RDBMS. Hadoop has made its place in the industries and companies that
need to work on large data sets which are sensitive and needs efficient handling.
Hadoop is a framework that enables processing of large data sets which reside in the
form of clusters. Being a framework, Hadoop is made up of several modules that are
supported by a large ecosystem of technologies.

Hadoop Ecosystem is a platform or a suite which provides various services to solve


the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services
such as absorption, analysis, storage and maintenance of data etc.

Here are some key components of the Hadoop ecosystem:

1. Hadoop Distributed File System (HDFS): This is the storage component of Hadoop,
designed to store large files across multiple machines in a distributed manner.

​ 2. MapReduce: It's a programming model and processing engine used to process


vast amounts of data in parallel across a distributed cluster.

​ 3. YARN (Yet Another Resource Negotiator): This is the resource management layer
of Hadoop that manages resources and schedules tasks across the cluster.

​ 4. Hadoop Common: This includes libraries and utilities needed by other modules
within the Hadoop ecosystem.

​ 5. Hive: A data warehouse infrastructure built on top of Hadoop for querying and
analyzing large datasets stored in HDFS using a SQL-like language called HiveQL.

​ 6. Pig: Another high-level platform for analyzing large datasets. It provides a simple
language called Pig Latin, which is used to perform data manipulation tasks.

​ 7. HBase: It's a NoSQL database that runs on top of Hadoop and provides real-time
read/write access to large datasets.

​ 8. Spark: While not part of the core Hadoop ecosystem, Spark works seamlessly with
Hadoop and provides a faster and more general-purpose alternative to MapReduce
for data processing.

​ 9. Mahout: It's a machine learning library built on top of Hadoop for scalable and
distributed machine learning algorithms.

​ 10. ZooKeeper: A centralized service for maintaining configuration information,
naming, providing distributed synchronization, and group services.

​ 11. Sqoop: A tool designed for efficiently transferring bulk data between Hadoop and
structured data stores like relational databases.

​ 12. Flume and Kafka: These are used for streaming data into Hadoop from various
sources.

The Hadoop ecosystem is continuously evolving, with new tools and technologies
being developed to address different aspects of big data processing, storage,
and analysis. Its flexibility and scalability make it a popular choice for organizations
dealing with large volumes of data.
Experiment 2
Basic HDFS Commands

a. ls: This command is used to list all the files. Use lsr for recursive approach. It is
useful when we want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so,
bin/hdfs means we want the executables of hdfs particularly dfs(Distributed File
System) commands.

b. mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So


let’s first create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>

creating home directory:

hdfs/bin -mkdir /user


hdfs/bin -mkdir /user/username -> write the username of your computer
Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path
bin/hdfs dfs -mkdir geeks2 => Relative path -> the folder will be
created relative to the home directory.

c. touchz: It creates an empty file.


Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt

d. copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This
is the most important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example: Let’s suppose we have a file AI.txt on Desktop which we want to copy to
folder geeks present on hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks

(OR)

bin/hdfs dfs -put ../Desktop/AI.txt /geeks

e. cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->

f. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

(OR)

bin/hdfs dfs -get /geeks/myfile.txt ../Desktop/hero


myfile.txt from geeks folder will be copied to folder hero present on Desktop.
Note: Observe that we don’t write bin/hdfs while checking the things present on local
filesystem.

g. moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

h. cp: This command is used to copy files within hdfs. Lets copy folder geeks to
geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied

i. mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt
from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
j. rmr: This command deletes a file from HDFS recursively. It is very useful command
when you want to delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the
directory then the directory itself.

k. du: It will give the size of each file in directory.


Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks

l. dus:: This command will give the total size of directory/file.


Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
m. stat: It will give the last modified time of directory or path. In short it will give stats of
the directory or file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geeks

n. setrep: This command is used to change the replication factor of a file/directory in


HDFS. By default it is 3 for anything which is stored in HDFS (as set in hdfs
core-site.xml).

Example 1: To change the replication factor to 6 for geeks.txt stored in HDFS.


bin/hdfs dfs -setrep -R -w 6 geeks.txt

Example 2: To change the replication factor to 4 for a directory geeksInput stored in


HDFS.

bin/hdfs dfs -setrep -R 4 /geeks


Note: The -w means wait till the replication is completed. And -R means recursively,
we use it for directories as they may also contain many files and folders inside them.

Note: There are more commands in HDFS but we discussed the commands which are
commonly used when working with Hadoop. You can check out the list of dfs commands
using the following command: bin/hdfs dfs
Experiment 3
Hadoop filesystem navigation and manupulation using commands

Hadoop FS Command Line


The Hadoop FS command line is a simple way to access and interface with HDFS. Below
are some basic HDFS commands in Linux, including operations like creating directories,
moving files, deleting files, reading files, and listing directories.

To use HDFS commands, start the Hadoop services using the following command:

sbin/start-all.sh

To check if Hadoop is up and running:

jps

Below cover several basic HDFS commands, along with a list of more File system
commands given command -help.

mkdir:
To create a directory, similar to Unix ls command.

Options:
-p : Do not fail if the directory already exists

$ hadoop fs -mkdir [-p]

ls:
List directories present under a specific directory in HDFS, similar to Unix ls command. The
-lsr command can be used for recursive listing of directories and files.

Options:
-d : List the directories as plain files
-h : Format the sizes of files to a human-readable manner instead of number of bytes
-R : Recursively list the contents of directories

$ hadoop fs -ls [-d] [-h] [-R]

copyFromLocal:
Copy files from the local file system to HDFS, similar to -put command. This command will
not work if the file already exists. To overwrite the destination if the file already exists, add -f
flag to command.

Options:

-p : Preserves access and modification time, ownership and the mode


-f : Overwrites the destination
$ hadoop fs -copyFromLocal [-f] [-p] …

copyToLocal:
Copy files from HDFS to local file system, similar to -get command.

$ hadoop fs -copyToLocal [-p] [-ignoreCrc] [-crc] ...

cat:
Display contents of a file, similar to Unix cat command.

$ hadoop fs -cat /user/data/sampletext.txt

cp:
Copy files from one directory to another within HDFS, similar to Unix cp command.

$ hadoop fs -cp /user/data/sample1.txt /user/hadoop1


$ hadoop fs -cp /user/data/sample2.txt /user/test/in1

mv:
Move files from one directory to another within HDFS, similar to Unix mv command.

$ hadoop fs -mv /user/hadoop/sample1.txt /user/text/

rm:
Remove a file from HDFS, similar to Unix rm command. This command does not delete
directories. For recursive delete, use command -rm -r.

Options:
-r : Recursively remove directories and files
-skipTrash : To bypass trash and immediately delete the source
-f : Mention if there is no file existing
-rR : Recursively delete directories

$ hadoop fs -mv /user/hadoop/sample1.txt /user/text/

getmerge:
Merge a list of files in one directory on HDFS into a single file on local file system. One of the
most important and useful commands when trying to read the contents of map reduce job or
pig job’s output files.

$ hadoop fs -getmerge /user/data

setrep:
Change replication factor of a file to a specific instead of default replication factor for
remaining in HDFS. If it is a directory, then the command will recursively change in the
replication of all the files in the directory tree given the input provided.
Options:
-w : Request the command wait for the replication to be completed (potentially takes a long
time)
-r : Accept for backwards compatibility and has no effect

$ hadoop fs -setrep [-R] [-w]

touchz:
Creates an empty file in HDFS.

$ hadoop fs -touchz URI

test:
Test an HDFS file’s existence of an empty file or if it is a directory or not.

Options:
-w : Request the command wait for the replication to be completed (potentially takes a long
time)
-r : Accept for backwards compatibility and has no effect

$ hadoop fs -setrep [-R] [-w]

appendToFile:
Appends the contents of all given local files to the provided destination file on HDFS. The
destination file will be created if it doesn’t already exist.

$ hadoop fs -appendToFile

chmod:
Change the permission of a file, similar to Linux shell’s command but with a few exceptions.

<MODE> Same as mode used for the shell’s command with the only letters recognized are
‘rwxXt’

<OCTALMODE> Mode specified in 3 or 4 digits. It is not possible to specify only part of the
mode, unlike the shell command.

Options:
-R : Modify the files recursively

$ hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH

chown:
Change owner and group of a file, similar to Linux shell’s command but with a few
exceptions.

Options:
-R : Modify the files recursively
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH

df:
Show capacity (free and used space) of the file system. The status of the root partitions are
provided if the file system has multiple partitions and no path is specified.

Options:
-h : Format the sizes of files to a human-readable manner instead of number of bytes

$ hadoop fs -df [-h] [<path> ...]

du:
Show size of each file in the directory.

Options:
-s : Show total summary size
-h : Format the sizes of files to a human-readable manner instead of number of bytes

$ hadoop fs -df [-h] [<path> ...]

tail:
Show the last 1KB of the file.

Options:
-f : Show appended data as the file grows

$ hadoop fs -tail [-f]


Experiment 4
Implement the following file mangement tasks in hadoop

In Hadoop Distributed File System (HDFS), you can perform various file management tasks
using command-line interfaces or programming languages like Java. Let's cover how you
can achieve these tasks:

A) Adding Files and Directories:

1. Adding Files to HDFS:


Use the hdfs dfs -put command to add files from your local file system to HDFS:

bash
hdfs dfs -put localfile.txt /user/hadoop/destination_directory/

Replace localfile.txt with the local file path and /user/hadoop/destination_directory/ with the
HDFS directory where you want to place the file.

2. Creating Directories in HDFS:


Use the hdfs dfs -mkdir command to create directories in HDFS:

bash
hdfs dfs -mkdir /user/hadoop/new_directory

This will create a new directory named new_directory under the /user/hadoop/ directory.

B) Retrieving Files:
Use the hdfs dfs -get command to retrieve files from HDFS to your local file system:

bash
hdfs dfs -get /user/hadoop/source_directory/file.txt localfile.txt
Replace /user/hadoop/source_directory/file.txt with the HDFS file path and localfile.txt with
the destination path in your local file system.

C) Deleting Files:
For deleting files or directories in HDFS, you can use the hdfs dfs -rm command for files or
hdfs dfs -rm -r for directories:

1. Delete File:

bash
hdfs dfs -rm /user/hadoop/file_to_delete.txt

Replace /user/hadoop/file_to_delete.txt with the path of the file you want to delete.

2. Delete Directory and Its Contents Recursively:

bash
hdfs dfs -rm -r /user/hadoop/directory_to_delete
Replace /user/hadoop/directory_to_delete with the directory path you want to delete along
with its contents.

Make sure to exercise caution while performing delete operations, especially for directories,
as the -r flag removes them recursively.

These commands can be executed in the terminal or command prompt when connected to a
machine with Hadoop installed and configured, and the appropriate permissions are granted
for file manipulation in HDFS. Adjust paths and filenames as per your specific HDFS
directory structure and file names.
Experiment 6
Process different datasets using pig.

Pig is a powerful tool for processing various datasets using its data flow language, Pig Latin.
Let's consider a scenario where you have multiple datasets, and you want to perform
operations on them using Pig.

Sample Datasets:
Let's say you have two datasets:

Users Dataset (users.csv):


user_id,name,age,gender
1,Alice,28,Female
2,Bob,35,Male
3,Charlie,22,Male
4,Diana,30,Female

Transactions Dataset (transactions.csv):


user_id,product,amount
1,Apple,10
2,Orange,15
3,Banana,8
1,Grapes,12
4,Apple,11

Pig Script to Process Datasets:


Here's a Pig script (data_processing.pig) that joins these datasets based on the user_id and
calculates the total amount spent by each user:

pig
-- Load the users data
users = LOAD 'users.csv' USING PigStorage(',') AS (user_id:int, name:chararray, age:int,
gender:chararray)
-- Load the transactions data
transactions = LOAD 'transactions.csv' USING PigStorage(',') AS (user_id:int,
product:chararray, amount:int);
-- Join the datasets on user_id
joined_data = JOIN users BY user_id, transactions BY user_id;
-- Group the joined data by user_id and calculate the total amount spent by each user
total_spent = FOREACH (GROUP joined_data BY users::user_id) {
user = group.users::user_id;
total = SUM(transactions.amount);
GENERATE user AS user_id, SUM(transactions.amount) AS total_amount_spent;
}
-- Store the results back to HDFS
STORE total_spent INTO 'output/total_spent_by_user' USING PigStorage(',');
Running the Pig Script:

To execute this Pig script, save it as data_processing.pig and use the following command in
the terminal or command prompt:

bash
pig data_processing.pig
This script will join the two datasets based on the user_id, calculate the total amount spent
by each user, and store the results in the HDFS directory output/total_spent_by_user.

Adjust file paths, delimiter, and column names in the script according to your actual dataset
structure and locations in HDFS.

You might also like