0% found this document useful (0 votes)
27 views27 pages

Module-II Introduction To Hadoop

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views27 pages

Module-II Introduction To Hadoop

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Big Data Analytics (BCS714D)

Sub- Big Data Analytics (BCS714D)

MODULE –II
Introduction to Hadoop

Prepared by

Dr Bindiya MK Chetana K.N


[Link], Dept of CS
Professor, Dept of CSE [Link],Dept of CSE

1
Big Data Analytics (BCS714D)

Module-II
Hadoop Distributed File System
Hadoop Distributed File System(HDFS)-

Hadoop Distributed File System (HDFS) is the storage component of Hadoop, designed to
handle large datasets efficiently. It is inspired by the Google File System (GFS) and is
optimized for high-throughput operations.

Key Features of HDFS

1. DistributedStorage–HDFSspreadsdataacrossmultiplemachines,ensuringhigh
availability.
2. LargeBlockSize–Insteadofsmallfilechunks,HDFSuseslargeblocks(default: 64MB or
128MB) to minimize disk seek time.
3. FaultTolerance–Dataisreplicatedacrossdifferentnodestopreventdataloss.

4. Data Locality–Processing happens where the data is stored to improve efficiency.

5. Compatible with Various OS File Systems – It runs on ext3, ext4, or other native file
systems.
HDFS Storage Example

If a file named [Link] is 192MB in size and the default block size is 64MB, HDFS will
divide it into three blocks and distribute them across different nodes. Each block is replicated
based on the default replication factor (3).

HDFS Components(Daemons)

2
Big Data Analytics (BCS714D)

3
Big Data Analytics (BCS714D)

1. Name Node(MasterNode)

 Manages the file system namespace (meta data about files,directories,and block
locations).
 Stores metadata in memory for fast access.

 Uses two files to keep track of data:

o Fs Image– Stores the entire file system structure.

o Edit Log–Tracks changes like file creation,deletion,orrenaming.


 Usesarack -awareplacement strategy to optimize performance and reliability.

4
Big Data Analytics (BCS714D)

2. Data Node(WorkerNode)

 Stores the actual data blocks.


 Communicates with the NameNode through heartbeat messages (every 3 seconds) to
confirm that it is active.
 If a DataNode stops sending heartbeats,NameNode automatically replicates the data to
another node.
3. Secondary Name Node(Checkpoint Node)

 No ta backup NameNode but helps by periodically saving NameNode’s metadata.

 Takes snapshots of the FsImage and Edit Log to preventName Node memory
overload.

 If the Name Node fails, the Secondary Name Node can manually restore the cluster.

How Data is Read from HDFS/ Anatomy of file Read Operation

Steps in Reading a File from HDFS

1. The client sends a read request to the NameNode.

2. The 3+NameNoderesponds withthelistofDataNodeswheretheblocksarestored.

3. TheclientreadsfromthenearestDataNode(for efficiency).

4. Ifablockiscorrupted, theclient readsfrom another replica.

5. Afterreadingallblocks,theclientassemblesthecompletefile.

5
Big Data Analytics (BCS714D)

Example:
Ifyouopena500MBvideofile,itwillbereadin64MBchunksfromdifferentDataNodesand merged
to play smoothly.

6
Big Data Analytics (BCS714D)

Anatomy of a File write Operation/ HowDataisWrittentoHDFS?

Steps in Writing a File to HDFS

1. The client requests file creation from the NameNode.


2. The Name Node checks whether the file already exists. If not ,it allocates blocks.

3. Data is divided into packets and sent to Data Nodes in a pipeline.

4. The first DataNode stores the packet and forwards it to the next Data Node.

7
Big Data Analytics (BCS714D)

5. EachDataNodesendsanacknowledgmentbacktoconfirmsuccessfulstorage.

6. Onceallblocksarestored,theclientclosestheconnection.

Example:
Ifyou uploada 2GB dataset, HDFS will divideit into 64MB blocks, replicatethem, and store
them in different nodes.

ReplicaPlacementStrategyinHDFS

DefaultReplicationStrategy(ReplicationFactor: 3)

1. FirstReplica–Stored onthe same nodeastheclient.

2. SecondReplica–Stored onadifferentrackforredundancy.
3. ThirdReplica–Storedonthesamerackasthesecondbutonadifferentnode. This
ensures high availability and fault tolerance while reducing network congestion.
CommonHDFSCommands

Command Action

Lists all directories and files at the root


Hadoopfs-ls/
ofHDFS.

8
Big Data Analytics (BCS714D)

hadoopfs-mkdir/sample Createsadirectorynamed sampleinHDFS.

[Link] /sample/ CopiesafilefromlocalstoragetoHDFS.

RetrievesafilefromHDFStothelocal system.
hadoopfs-get/sample/[Link]/

hadoop fs -copyFromLocal [Link]


Copiesafilefromlocalto HDFS.
/sample/

hadoop fs -copyToLocal /sample/[Link]


CopiesafilefromHDFStolocal.
[Link]

hadoopfs-cat/sample/[Link] Displaysthecontent ofanHDFS file.

hadoopfs -rm-r /sample/ DeletesadirectoryfromHDFS.

SpecialFeaturesofHDFS
1. DataReplication–Ensuresredundancybystoringmultiplecopiesof data.

2. DataPipeline–Efficientwritingofdatausingapipeline mechanism.

3. FaultTolerance–Automaticdatarecoveryincaseof failure.

4. Scalability–Easilyhandlespetabytesofdatabyaddingnew nodes.

Processing Data with Hadoop

MapReduceDaemons

MapReduceoperatesusingtwokeydaemons:[Link] components
work together to manage and execute tasks in a distributed environment.

JobTracker

The JobTrackeris the central daemon that coordinates the execution of a MapReduce job. It
functions as the master node in a Hadoop cluster and is responsible for assigning tasks to

9
Big Data Analytics (BCS714D)

various nodes in the system. When a user submits a job, the JobTracker first creates an
execution plan by deciding how to split the input data and distribute tasks among available
TaskTrackers.

It continuously monitors the status of each task and takes corrective measures if a failure
occurs. For instance, if a TaskTracker stops responding, the JobTracker assumes that the task
[Link]
[Link],thereisonlyoneJobTracker.

TaskTracker

The TaskTrackeris a daemon that runs on each worker node in a Hadoop cluster. It is
[Link] in the
cluster has a single TaskTracker, which manages multiple JVM instances to process multiple
tasks simultaneously.

TheTaskTrackercontinuouslysendsheartbeatsignalstotheJobTrackertoindicateitshealth and
availability. If the JobTracker fails to receive a heartbeat from a TaskTracker within a certain
time frame, it assumes that the node has failed and reschedules the task on another
[Link] and
ensuring efficient use of cluster resources.

HowDoesMapReduceWork?

10
Big Data Analytics (BCS714D)

MapReduce follows a two-step process: map and [Link] approach enables the parallel
processing of large datasets by dividing the workload across multiple nodes.

11
Big Data Analytics (BCS714D)

Work flow of MapReduce

Map Reduce works by splitting a large dataset into multiple smaller chunks, which are
processed independently by different worker nodes. Each mapper operates on a subset of the
[Link],which
aggregates and processes them to generate the final output.

Step sin Map Reduce Execution

The execution of a Map Reduce job follows these steps:

1. The input dataset is divided into multiple smaller pieces. These data chunks are
processed in parallel by different nodes.

2. A master process creates multiple worker processes and assigns them to different
nodes in the cluster.

3. Each mapper processes its assigned data chunk, applies the map function, and
generates key-value pairs. For example, in a word count program, the map function
converts a sentence into key-value pairs like (word, 1).

4. The partitioner function then distributes the mapped data into regions. It determines
which reducer should process a particular key-value pair.

12
Big Data Analytics (BCS714D)

5. Once all mappers complete the ir tasks, the master node instructs the reducers to
start processing. The reducers first retrieve, shuffle, and sort the mapped key-value
pairs based on the key.

6. The reducers then apply the reduce function ,which aggregates values for each
unique key and writes the final result to a file.

7. Once all reducers complete the ir tasks,the system returns control to the user,andthe
job is marked as complete.

By following this approach, Map Reduce enables efficient distributed computing, allowing
massive datasets to be processed quickly.

Map Reduce Example:Word Count

A classic example of Map Reduce programming is counting the occurrences of words across
multiple files. This is achieved by using three main classes: the Driver class, the Mapper
class, and the Reducer class.

13
Big Data Analytics (BCS714D)

Implementation of Word Countin Java

In this example, we will write a Map Reduce program to count how many times each word
appears in a collection of text files.

Driver Class ([Link])

The Driver class is responsible for setting up the Map Reduce job configuration. It defines the
mapper and reducer classes, specifies the input and output paths, and submits the job for
execution.

[Link];

import [Link];
[Link];
import [Link];

14
Big Data Analytics (BCS714D)

import [Link];

import [Link];

import [Link];
[Link];

publicclassWordCounter{

publicstaticvoidmain(String[]args)throwsException{ Job
job = new Job();
[Link]("wordcounter");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

[Link](job, new Path("/sample/[Link]"));


[Link](job,newPath("/sample/wordcount"));

[Link]([Link](true)?0:1);

MapperClass([Link])

The Mapper class takes an input file and splits each line into individual words. For
eachword encountered, it emits a (word, 1) key-value pair.

[Link];

import [Link];

import [Link];

15
Big Data Analytics (BCS714D)

import [Link];
import [Link];
import [Link];
publicclassWordCounterMapextendsMapper<LongWritable,Text,Text,IntWritable>{ @Overr
ide
protectedvoidmap(LongWritablekey,Textvalue,Contextcontext) throws
IOException, InterruptedException {
String[]words=[Link]().split(""); for
(String word : words) {
[Link](newText(word),newIntWritable(1));

ReducerClass([Link])

TheReducerclasstakesthemappedkey-valuepairsandaggregatesthecountsforeach unique
word.

[Link];

import [Link];

import [Link];
import [Link];
import [Link];
publicclassWordCounterRedextendsReducer<Text,IntWritable,Text,IntWritable>{ @Overrid
e
protectedvoidreduce(Textword,Iterable<IntWritable>values,Contextcontext) throws
IOException, InterruptedException {
intcount = 0;

for(IntWritableval:values){ count
+= [Link]();
}

16
Big Data Analytics (BCS714D)

[Link](word,newIntWritable(count));

By running this program, we can efficiently count the occurrences of each word in a large
dataset.

Managing Resources and Applications with Hadoop YARN (Yet Another Resource
Negotiator)

Introduction to HadoopYARN

Hadoop YARN is a core component of Hadoop 2.x. It is an advanced, flexible, and scalable
resource management framework that enhances Hadoop beyond just Map Reduce. Unlike
Hadoop 1.0, which was strictly bound to MapReduce, YARN allows multiple applications to
share resources efficiently. This means that different types of data processing—such as batch
processing, interactive queries, streaming, and graph processing—can be performed in the
same Hadoop ecosystem.
LimitationsofHadoop1.0Architecture

Hadoop1.0hadseverallimitationsthatledtoinefficienciesinresourcemanagementanddata
processing. The main issues included:

1. Single NameNode Bottleneck: In Hadoop 1.0, a single Name Node managed the
entire namespace of the Hadoop cluster, creating a single point of failure and limiting
scalability.

2. Limited Processing Model:The system primarily supported batch-orientedMapReduce


jobs, making it unsuitable for interactive or real-time data analysis.

3. Not Ideal for Advanced Analytics: Hadoop 1.0 struggled with workloads such as
machine learning, graph processing, and memory-intensive computations.

4. Inefficient Resource Management: The system allocated separate slots for map and
reduce tasks. This led to situations where map slots were full while reduce slots
remained idle, or vice versa, leading to poor resource utilization.

17
Big Data Analytics (BCS714D)

These issues demonstrated the need for an improved architecture, which was introduced in
Hadoop 2.x with YARN.
HDFSLimitationinHadoop1.0

TheHadoopDistributedFileSystem(HDFS)facedamajorchallengeinitsarchitecture:

 The NameNode stored all metadata in its main memory. While modern memory is
larger and cheaper than before, there is still a limit to how many files and objects a
single NameNode can manage.

 As the number of files in a cluster increased, the NameNode became overloaded,


leading to performance issues and a risk of failure.

SolutioninHadoop2.x:HDFS Federation

Toaddresstheselimitations,Hadoop2.x introduced HDFS Federation,which allowed multiple


[Link],
reducing the burden on any single Name Node and improving scalability.

HDFS2Features

[Link] to HDFS:

1. Horizontal Scalability: By allowing multiple Name Nodes, Hadoop 2.x could scale
efficiently as more data was added.

2. High Availability: A new feature called Active-Passive Standby Name Node was
introduced .In case of a failure of the Active Name Node,the Passive Name Node
would automatically take over, ensuring uninterrupted operation.

These improvements made Hadoop much more robust and capable of handling large-scale
enterprise workloads.

[Link]:Expanding Hadoop Beyond BatchProcessing

YARN(YetAnotherResourceNegotiator)[Link]
enhancesresourcemanagement.UnlikeHadoop1.0,whereMapReducehandledbothresource
management and data processing, YARN separates these concerns, making Hadoop more

18
Big Data Analytics (BCS714D)

flexible

Key Advantages of YARN

 Allows Hadoop to support multiple data processing frameworks beyond Map Reduce,
such as Spark, Tez, and Storm.

 Providesbetterresourceutilizationbydynamicallyallocatingresourcesbasedon demand.

 Improvesoverallsystemefficiencyandenablesreal-timeandinteractivedata processing.

19
Big Data Analytics (BCS714D)

Architecture of YARN

YARN introduces the following key components:

1. Resource Manager(Global)

o TheResourceManagerisresponsibleforallocatingclusterresourcesacrossall
applications.

o It consists of:

 Scheduler: Allocates resources based on demand but does not track the
progress of applications.

 Application Manager: Manages job submissions,resource negotiations,


and restarts failed applications.
2. Node Manager(Per Machine Slave Daemon)

o Runs on each node in the cluster.

o Monitors resource usage (CPU,memory,disk,network)and reportstothe


Resource Manager.

o Launches and tracks the execution of application containers.

3. Application Master(Per Application Manager)

o Manages the execution of a specific application.

o Negotiates required resources with the Resource Manager.

o Works with the Node Manager to launch tasks.

Basic Concepts in YARN

Applications

 An application in YARN refers to a job submitted for execution.

 Example: A Map Reduce job is an application that requires resources to


execute. Containers

20
Big Data Analytics (BCS714D)

 Containers are the basic units of resource allocation in YARN.

 They allow fine-grained resource allocation for different typesof processing.

 Example:

o container_0 =2GB RAM, 1 CPU

o container_1 =1GB RAM, 6 CPUs

 This dynamic allocation replaces the fixed map/reduces lots from Hadoop1.0.
Working of YARN(Step-by-Step)

1. A client submits an application to the Resource Manager.

2. The Resource Manager allocates a container to launch the Application Master.

3. The Application Master registers with the Resource Manager, enabling the client to
track its progress.

4. The Application Master requests additional containers for running the actual tasks.

5. Upon successful allocation, the Application Master assigns tasks to the Node
Manager.

6. The Node Manager executes the tasks and reports back to the Application Master.
7. The client communicates with the Application Master for status updates.
8. Once the job is completed, the Application Master shuts down,and the allocated resources are
released for reuse.

21
Big Data Analytics (BCS714D)

Interaction Hadoop Eco System


Pig is a data flow system for Hadoop. It uses Pig Latin to specify data flow. Pig is an
alternative to MapReduce Programming. It abstracts some details and allows you to focus on
data processing. It consists of two components. 1. Pig Latin: The data processing language.
2. Compiler: To translate Pig Latin to MapReduce Programming.
Hive Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries can be done
using an SQL-like language. Hive can be used to do ad-hoc queries, summarization, and data
analysis. Figure 5.31 depicts Hive in the Hadoop ecosystem.
Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and Relational
Databases. With the help of Sqoop, you can import data from RDBMS to HDFS and vice-
versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented NoSQL
database. HBase is used to store billions of rows and millions of columns. HBase provides
random read/write operation. It also supports record level updates which is not possible using
HDFS.
.

22
Big Data Analytics (BCS714D)

23
Big Data Analytics (BCS714D)

MAP REDUCE
In MapReduce Programming, Jobs (Applications) are split into a set of map tasks and reduce
tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster. Each task
processes small subset of data that has been assigned to it. This way, Hadoop distributes the
load across the cluster. MapReduce job takes a set of files that is stored in HDFS (Hadoop
Distributed File System) as input. Map task takes care of loading, parsing, transforming, and
filtering. The responsibility of reduce task is grouping and aggregating data that is produced
by map tasks to generate final output. Each map task is broken into the following phases:
1. RecordReader.
2. Mapper.
3. Combiner.
4. Partitioner.
The output produced by map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken into the
following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network;

Mapper
A mapper maps the input key−value pairs into a set of intermediate key–value pairs. Maps are
individual tasks that have the responsibility of transforming input records into intermediate
key–value pairs.
1. RecordReader:RecordReader converts a byte-oriented view of the input (as generated by
the InputSplit) into a record-oriented view and presents it to the Mapper tasks. It presents the
tasks with keys and values. Generally the key is the positional information and value is a
chunk of data that constitutes the record.

24
Big Data Analytics (BCS714D)

2. Map: Map function works on the key–value pair produced by RecordReader and
generates zero or more intermediate key–value pairs. The Map Reduce decides the key–value
pair based on the context.
3. Combiner: It is an optional function but provides high performance in terms of network
bandwidth and disk space. It takes intermediate key–value pair provided by mapper and
applies user-specific aggregate function to only that mapper. It is also known as local reducer.
4. Partitioner: The partitioner takes the intermediate key–value pairs produced by the
mapper, splits them into shard, and sends the shard to the particular reducer as per the user-
specific code. Usually, the key with same values goes to the same reducer. The partitioned
data

Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share
a common key) to a smaller set of values. The Reducer has three primary phases: Shuffle and
Sort, Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and downloads them
into the local machine where the reducer is running. Then these individual data pipes are
sorted by keys which produce larger data list. The main purpose of this sort is grouping
similar words so that their values can be easily iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort phase,
applies reduce function, and processes one group at a time. The reduce function iterates all
the values associated with that key. Reducer function provides various operations such as
aggregation, filtering, and combining data. Once it is done, the output (zero or more key–
value pairs) of reducer is sent to the output format.
3. Output Format: The output format separates key–value pair with tab (default) and writes
it out to a file using record writer. Figure 8.1 describes the chores of Mapper, Combiner,
Partitioner, and Reducer.
Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem. The Word Count problem has been discussed under “Combiner” and
“Partitioner”.

25
Big Data Analytics (BCS714D)

Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be
the combiner class. The difference between combiner class and reducer class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the reducer.
2. Output of the reducer is passed to the output file on the disk.
Partitioner
The partitioning phase happens after map phase and before reduce phase. Usually the number
of partitions are equal to the number of reducers. The default partitioner is hash function.
Searching
Sorting
Compression
In MapReduce programming, you can compress the MapReduce output file. Compression
provides two benefits as follows:

26
Big Data Analytics (BCS714D)

1. Reduces the space to store files.


2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
[Link]("[Link]", true);
[Link]("[Link]",
[Link],[Link]); Here, codec is the implementation of a
compression and decompression algorithm. GzipCodec is the compression

27

You might also like