0% found this document useful (0 votes)

27 views27 pages

Module-II Introduction To Hadoop

Uploaded by

karunaaggarwal2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views27 pages

Module-II Introduction To Hadoop

Uploaded by

karunaaggarwal2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Big Data Analytics (BCS714D)

Sub- Big Data Analytics (BCS714D)

MODULE –II
Introduction to Hadoop

Prepared by

Dr Bindiya MK Chetana K.N

[Link], Dept of CS
Professor, Dept of CSE [Link],Dept of CSE

1
Big Data Analytics (BCS714D)

Module-II
Hadoop Distributed File System
Hadoop Distributed File System(HDFS)-

Hadoop Distributed File System (HDFS) is the storage component of Hadoop, designed to
handle large datasets efficiently. It is inspired by the Google File System (GFS) and is
optimized for high-throughput operations.

Key Features of HDFS

1. DistributedStorage–HDFSspreadsdataacrossmultiplemachines,ensuringhigh
availability.
2. LargeBlockSize–Insteadofsmallfilechunks,HDFSuseslargeblocks(default: 64MB or
128MB) to minimize disk seek time.
3. FaultTolerance–Dataisreplicatedacrossdifferentnodestopreventdataloss.

4. Data Locality–Processing happens where the data is stored to improve efficiency.

5. Compatible with Various OS File Systems – It runs on ext3, ext4, or other native file
systems.
HDFS Storage Example

If a file named [Link] is 192MB in size and the default block size is 64MB, HDFS will
divide it into three blocks and distribute them across different nodes. Each block is replicated
based on the default replication factor (3).

HDFS Components(Daemons)

2
Big Data Analytics (BCS714D)

3
Big Data Analytics (BCS714D)

1. Name Node(MasterNode)

 Manages the file system namespace (meta data about files,directories,and block
locations).
 Stores metadata in memory for fast access.

 Uses two files to keep track of data:

o Fs Image– Stores the entire file system structure.

o Edit Log–Tracks changes like file creation,deletion,orrenaming.

 Usesarack -awareplacement strategy to optimize performance and reliability.

4
Big Data Analytics (BCS714D)

2. Data Node(WorkerNode)

 Stores the actual data blocks.

 Communicates with the NameNode through heartbeat messages (every 3 seconds) to
confirm that it is active.
 If a DataNode stops sending heartbeats,NameNode automatically replicates the data to
another node.
3. Secondary Name Node(Checkpoint Node)

 No ta backup NameNode but helps by periodically saving NameNode’s metadata.

 Takes snapshots of the FsImage and Edit Log to preventName Node memory
overload.

 If the Name Node fails, the Secondary Name Node can manually restore the cluster.

How Data is Read from HDFS/ Anatomy of file Read Operation

Steps in Reading a File from HDFS

1. The client sends a read request to the NameNode.

2. The 3+NameNoderesponds withthelistofDataNodeswheretheblocksarestored.

3. TheclientreadsfromthenearestDataNode(for efficiency).

4. Ifablockiscorrupted, theclient readsfrom another replica.

5. Afterreadingallblocks,theclientassemblesthecompletefile.

5
Big Data Analytics (BCS714D)

Example:
Ifyouopena500MBvideofile,itwillbereadin64MBchunksfromdifferentDataNodesand merged
to play smoothly.

6
Big Data Analytics (BCS714D)

Anatomy of a File write Operation/ HowDataisWrittentoHDFS?

Steps in Writing a File to HDFS

1. The client requests file creation from the NameNode.

2. The Name Node checks whether the file already exists. If not ,it allocates blocks.

3. Data is divided into packets and sent to Data Nodes in a pipeline.

4. The first DataNode stores the packet and forwards it to the next Data Node.

7
Big Data Analytics (BCS714D)

5. EachDataNodesendsanacknowledgmentbacktoconfirmsuccessfulstorage.

6. Onceallblocksarestored,theclientclosestheconnection.

Example:
Ifyou uploada 2GB dataset, HDFS will divideit into 64MB blocks, replicatethem, and store
them in different nodes.

ReplicaPlacementStrategyinHDFS

DefaultReplicationStrategy(ReplicationFactor: 3)

1. FirstReplica–Stored onthe same nodeastheclient.

2. SecondReplica–Stored onadifferentrackforredundancy.
3. ThirdReplica–Storedonthesamerackasthesecondbutonadifferentnode. This
ensures high availability and fault tolerance while reducing network congestion.
CommonHDFSCommands

Command Action

Lists all directories and files at the root

Hadoopfs-ls/
ofHDFS.

8
Big Data Analytics (BCS714D)

hadoopfs-mkdir/sample Createsadirectorynamed sampleinHDFS.

[Link] /sample/ CopiesafilefromlocalstoragetoHDFS.

RetrievesafilefromHDFStothelocal system.
hadoopfs-get/sample/[Link]/

hadoop fs -copyFromLocal [Link]

Copiesafilefromlocalto HDFS.
/sample/

hadoop fs -copyToLocal /sample/[Link]

CopiesafilefromHDFStolocal.
[Link]

hadoopfs-cat/sample/[Link] Displaysthecontent ofanHDFS file.

hadoopfs -rm-r /sample/ DeletesadirectoryfromHDFS.

SpecialFeaturesofHDFS
1. DataReplication–Ensuresredundancybystoringmultiplecopiesof data.

2. DataPipeline–Efficientwritingofdatausingapipeline mechanism.

3. FaultTolerance–Automaticdatarecoveryincaseof failure.

4. Scalability–Easilyhandlespetabytesofdatabyaddingnew nodes.

Processing Data with Hadoop

MapReduceDaemons

MapReduceoperatesusingtwokeydaemons:[Link] components
work together to manage and execute tasks in a distributed environment.

JobTracker

The JobTrackeris the central daemon that coordinates the execution of a MapReduce job. It
functions as the master node in a Hadoop cluster and is responsible for assigning tasks to

9
Big Data Analytics (BCS714D)

various nodes in the system. When a user submits a job, the JobTracker first creates an
execution plan by deciding how to split the input data and distribute tasks among available
TaskTrackers.

It continuously monitors the status of each task and takes corrective measures if a failure
occurs. For instance, if a TaskTracker stops responding, the JobTracker assumes that the task
[Link]
[Link],thereisonlyoneJobTracker.

TaskTracker

The TaskTrackeris a daemon that runs on each worker node in a Hadoop cluster. It is
[Link] in the
cluster has a single TaskTracker, which manages multiple JVM instances to process multiple
tasks simultaneously.

TheTaskTrackercontinuouslysendsheartbeatsignalstotheJobTrackertoindicateitshealth and
availability. If the JobTracker fails to receive a heartbeat from a TaskTracker within a certain
time frame, it assumes that the node has failed and reschedules the task on another
[Link] and
ensuring efficient use of cluster resources.

HowDoesMapReduceWork?

10
Big Data Analytics (BCS714D)

MapReduce follows a two-step process: map and [Link] approach enables the parallel
processing of large datasets by dividing the workload across multiple nodes.

11
Big Data Analytics (BCS714D)

Work flow of MapReduce

Map Reduce works by splitting a large dataset into multiple smaller chunks, which are
processed independently by different worker nodes. Each mapper operates on a subset of the
[Link],which
aggregates and processes them to generate the final output.

Step sin Map Reduce Execution

The execution of a Map Reduce job follows these steps:

1. The input dataset is divided into multiple smaller pieces. These data chunks are
processed in parallel by different nodes.

2. A master process creates multiple worker processes and assigns them to different
nodes in the cluster.

3. Each mapper processes its assigned data chunk, applies the map function, and
generates key-value pairs. For example, in a word count program, the map function
converts a sentence into key-value pairs like (word, 1).

4. The partitioner function then distributes the mapped data into regions. It determines
which reducer should process a particular key-value pair.

12
Big Data Analytics (BCS714D)

5. Once all mappers complete the ir tasks, the master node instructs the reducers to
start processing. The reducers first retrieve, shuffle, and sort the mapped key-value
pairs based on the key.

6. The reducers then apply the reduce function ,which aggregates values for each
unique key and writes the final result to a file.

7. Once all reducers complete the ir tasks,the system returns control to the user,andthe
job is marked as complete.

By following this approach, Map Reduce enables efficient distributed computing, allowing
massive datasets to be processed quickly.

Map Reduce Example:Word Count

A classic example of Map Reduce programming is counting the occurrences of words across
multiple files. This is achieved by using three main classes: the Driver class, the Mapper
class, and the Reducer class.

13
Big Data Analytics (BCS714D)

Implementation of Word Countin Java

In this example, we will write a Map Reduce program to count how many times each word
appears in a collection of text files.

Driver Class ([Link])

The Driver class is responsible for setting up the Map Reduce job configuration. It defines the
mapper and reducer classes, specifies the input and output paths, and submits the job for
execution.

[Link];

import [Link];
[Link];
import [Link];

14
Big Data Analytics (BCS714D)

import [Link];

import [Link];
[Link];

publicclassWordCounter{

publicstaticvoidmain(String[]args)throwsException{ Job
job = new Job();
[Link]("wordcounter");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

[Link](job, new Path("/sample/[Link]"));

[Link](job,newPath("/sample/wordcount"));

[Link]([Link](true)?0:1);

MapperClass([Link])

The Mapper class takes an input file and splits each line into individual words. For
eachword encountered, it emits a (word, 1) key-value pair.

[Link];

import [Link];

15
Big Data Analytics (BCS714D)

import [Link];
import [Link];
import [Link];
publicclassWordCounterMapextendsMapper<LongWritable,Text,Text,IntWritable>{ @Overr
ide
protectedvoidmap(LongWritablekey,Textvalue,Contextcontext) throws
IOException, InterruptedException {
String[]words=[Link]().split(""); for
(String word : words) {
[Link](newText(word),newIntWritable(1));

ReducerClass([Link])

TheReducerclasstakesthemappedkey-valuepairsandaggregatesthecountsforeach unique
word.

[Link];

import [Link];

import [Link];
import [Link];
import [Link];
publicclassWordCounterRedextendsReducer<Text,IntWritable,Text,IntWritable>{ @Overrid
e
protectedvoidreduce(Textword,Iterable<IntWritable>values,Contextcontext) throws
IOException, InterruptedException {
intcount = 0;

for(IntWritableval:values){ count
+= [Link]();
}

16
Big Data Analytics (BCS714D)

[Link](word,newIntWritable(count));

By running this program, we can efficiently count the occurrences of each word in a large
dataset.

Managing Resources and Applications with Hadoop YARN (Yet Another Resource
Negotiator)

Introduction to HadoopYARN

Hadoop YARN is a core component of Hadoop 2.x. It is an advanced, flexible, and scalable
resource management framework that enhances Hadoop beyond just Map Reduce. Unlike
Hadoop 1.0, which was strictly bound to MapReduce, YARN allows multiple applications to
share resources efficiently. This means that different types of data processing—such as batch
processing, interactive queries, streaming, and graph processing—can be performed in the
same Hadoop ecosystem.
LimitationsofHadoop1.0Architecture

Hadoop1.0hadseverallimitationsthatledtoinefficienciesinresourcemanagementanddata
processing. The main issues included:

1. Single NameNode Bottleneck: In Hadoop 1.0, a single Name Node managed the
entire namespace of the Hadoop cluster, creating a single point of failure and limiting
scalability.

2. Limited Processing Model:The system primarily supported batch-orientedMapReduce

jobs, making it unsuitable for interactive or real-time data analysis.

3. Not Ideal for Advanced Analytics: Hadoop 1.0 struggled with workloads such as
machine learning, graph processing, and memory-intensive computations.

4. Inefficient Resource Management: The system allocated separate slots for map and
reduce tasks. This led to situations where map slots were full while reduce slots
remained idle, or vice versa, leading to poor resource utilization.

17
Big Data Analytics (BCS714D)

These issues demonstrated the need for an improved architecture, which was introduced in
Hadoop 2.x with YARN.
HDFSLimitationinHadoop1.0

TheHadoopDistributedFileSystem(HDFS)facedamajorchallengeinitsarchitecture:

 The NameNode stored all metadata in its main memory. While modern memory is
larger and cheaper than before, there is still a limit to how many files and objects a
single NameNode can manage.

 As the number of files in a cluster increased, the NameNode became overloaded,

leading to performance issues and a risk of failure.

SolutioninHadoop2.x:HDFS Federation

Toaddresstheselimitations,Hadoop2.x introduced HDFS Federation,which allowed multiple

[Link],
reducing the burden on any single Name Node and improving scalability.

HDFS2Features

[Link] to HDFS:

1. Horizontal Scalability: By allowing multiple Name Nodes, Hadoop 2.x could scale
efficiently as more data was added.

2. High Availability: A new feature called Active-Passive Standby Name Node was
introduced .In case of a failure of the Active Name Node,the Passive Name Node
would automatically take over, ensuring uninterrupted operation.

These improvements made Hadoop much more robust and capable of handling large-scale
enterprise workloads.

[Link]:Expanding Hadoop Beyond BatchProcessing

YARN(YetAnotherResourceNegotiator)[Link]
enhancesresourcemanagement.UnlikeHadoop1.0,whereMapReducehandledbothresource
management and data processing, YARN separates these concerns, making Hadoop more

18
Big Data Analytics (BCS714D)

flexible

Key Advantages of YARN

 Allows Hadoop to support multiple data processing frameworks beyond Map Reduce,
such as Spark, Tez, and Storm.

 Providesbetterresourceutilizationbydynamicallyallocatingresourcesbasedon demand.

 Improvesoverallsystemefficiencyandenablesreal-timeandinteractivedata processing.

19
Big Data Analytics (BCS714D)

Architecture of YARN

YARN introduces the following key components:

1. Resource Manager(Global)

o TheResourceManagerisresponsibleforallocatingclusterresourcesacrossall
applications.

o It consists of:

 Scheduler: Allocates resources based on demand but does not track the
progress of applications.

 Application Manager: Manages job submissions,resource negotiations,

and restarts failed applications.
2. Node Manager(Per Machine Slave Daemon)

o Runs on each node in the cluster.

o Monitors resource usage (CPU,memory,disk,network)and reportstothe

Resource Manager.

o Launches and tracks the execution of application containers.

3. Application Master(Per Application Manager)

o Manages the execution of a specific application.

o Negotiates required resources with the Resource Manager.

o Works with the Node Manager to launch tasks.

Basic Concepts in YARN

Applications

 An application in YARN refers to a job submitted for execution.

 Example: A Map Reduce job is an application that requires resources to

execute. Containers

20
Big Data Analytics (BCS714D)

 Containers are the basic units of resource allocation in YARN.

 They allow fine-grained resource allocation for different typesof processing.

 Example:

o container_0 =2GB RAM, 1 CPU

o container_1 =1GB RAM, 6 CPUs

 This dynamic allocation replaces the fixed map/reduces lots from Hadoop1.0.
Working of YARN(Step-by-Step)

1. A client submits an application to the Resource Manager.

2. The Resource Manager allocates a container to launch the Application Master.

3. The Application Master registers with the Resource Manager, enabling the client to
track its progress.

4. The Application Master requests additional containers for running the actual tasks.

5. Upon successful allocation, the Application Master assigns tasks to the Node
Manager.

6. The Node Manager executes the tasks and reports back to the Application Master.
7. The client communicates with the Application Master for status updates.
8. Once the job is completed, the Application Master shuts down,and the allocated resources are
released for reuse.

21
Big Data Analytics (BCS714D)

Interaction Hadoop Eco System

Pig is a data flow system for Hadoop. It uses Pig Latin to specify data flow. Pig is an
alternative to MapReduce Programming. It abstracts some details and allows you to focus on
data processing. It consists of two components. 1. Pig Latin: The data processing language.
2. Compiler: To translate Pig Latin to MapReduce Programming.
Hive Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries can be done
using an SQL-like language. Hive can be used to do ad-hoc queries, summarization, and data
analysis. Figure 5.31 depicts Hive in the Hadoop ecosystem.
Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and Relational
Databases. With the help of Sqoop, you can import data from RDBMS to HDFS and vice-
versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented NoSQL
database. HBase is used to store billions of rows and millions of columns. HBase provides
random read/write operation. It also supports record level updates which is not possible using
HDFS.
.

22
Big Data Analytics (BCS714D)

23
Big Data Analytics (BCS714D)

MAP REDUCE
In MapReduce Programming, Jobs (Applications) are split into a set of map tasks and reduce
tasks. Then these tasks are executed in a distributed fashion on Hadoop cluster. Each task
processes small subset of data that has been assigned to it. This way, Hadoop distributes the
load across the cluster. MapReduce job takes a set of files that is stored in HDFS (Hadoop
Distributed File System) as input. Map task takes care of loading, parsing, transforming, and
filtering. The responsibility of reduce task is grouping and aggregating data that is produced
by map tasks to generate final output. Each map task is broken into the following phases:
1. RecordReader.
2. Mapper.
3. Combiner.
4. Partitioner.
The output produced by map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken into the
following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed resides.
This way, Hadoop ensures data locality. Data locality means that data is not moved over
network;

Mapper
A mapper maps the input key−value pairs into a set of intermediate key–value pairs. Maps are
individual tasks that have the responsibility of transforming input records into intermediate
key–value pairs.
1. RecordReader:RecordReader converts a byte-oriented view of the input (as generated by
the InputSplit) into a record-oriented view and presents it to the Mapper tasks. It presents the
tasks with keys and values. Generally the key is the positional information and value is a
chunk of data that constitutes the record.

24
Big Data Analytics (BCS714D)

2. Map: Map function works on the key–value pair produced by RecordReader and
generates zero or more intermediate key–value pairs. The Map Reduce decides the key–value
pair based on the context.
3. Combiner: It is an optional function but provides high performance in terms of network
bandwidth and disk space. It takes intermediate key–value pair provided by mapper and
applies user-specific aggregate function to only that mapper. It is also known as local reducer.
4. Partitioner: The partitioner takes the intermediate key–value pairs produced by the
mapper, splits them into shard, and sends the shard to the particular reducer as per the user-
specific code. Usually, the key with same values goes to the same reducer. The partitioned
data

Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share
a common key) to a smaller set of values. The Reducer has three primary phases: Shuffle and
Sort, Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and downloads them
into the local machine where the reducer is running. Then these individual data pipes are
sorted by keys which produce larger data list. The main purpose of this sort is grouping
similar words so that their values can be easily iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort phase,
applies reduce function, and processes one group at a time. The reduce function iterates all
the values associated with that key. Reducer function provides various operations such as
aggregation, filtering, and combining data. Once it is done, the output (zero or more key–
value pairs) of reducer is sent to the output format.
3. Output Format: The output format separates key–value pair with tab (default) and writes
it out to a file using record writer. Figure 8.1 describes the chores of Mapper, Combiner,
Partitioner, and Reducer.
Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem. The Word Count problem has been discussed under “Combiner” and
“Partitioner”.

25
Big Data Analytics (BCS714D)

Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be
the combiner class. The difference between combiner class and reducer class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the reducer.
2. Output of the reducer is passed to the output file on the disk.
Partitioner
The partitioning phase happens after map phase and before reduce phase. Usually the number
of partitions are equal to the number of reducers. The default partitioner is hash function.
Searching
Sorting
Compression
In MapReduce programming, you can compress the MapReduce output file. Compression
provides two benefits as follows:

26
Big Data Analytics (BCS714D)

1. Reduces the space to store files.

2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
[Link]("[Link]", true);
[Link]("[Link]",
[Link],[Link]); Here, codec is the implementation of a
compression and decompression algorithm. GzipCodec is the compression

Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
4
No ratings yet
4
53 pages
Hadoop Pipes and Heartbeat Overview
No ratings yet
Hadoop Pipes and Heartbeat Overview
18 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
5.apache Hadoop Updated
No ratings yet
5.apache Hadoop Updated
57 pages
HDFS Overview and Command Guide
No ratings yet
HDFS Overview and Command Guide
25 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
RTK Notes m1
No ratings yet
RTK Notes m1
16 pages
Hadoop & Big Data for Tech Students
No ratings yet
Hadoop & Big Data for Tech Students
45 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Big Data
No ratings yet
Big Data
47 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
Big Data Unit - 2
No ratings yet
Big Data Unit - 2
18 pages
Big Data
No ratings yet
Big Data
67 pages
Lecture 4 Introduction To Hadoop
No ratings yet
Lecture 4 Introduction To Hadoop
24 pages
HDFS Commands Updated
No ratings yet
HDFS Commands Updated
87 pages
HDFS
No ratings yet
HDFS
8 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Defining HDFS in Big Data
No ratings yet
Defining HDFS in Big Data
7 pages
Hdfs Part 2
No ratings yet
Hdfs Part 2
42 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Comprehensive Hadoop Overview and Components
No ratings yet
Comprehensive Hadoop Overview and Components
9 pages
Chap4 BigDataStorageAndManagement
No ratings yet
Chap4 BigDataStorageAndManagement
46 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
Unit I
No ratings yet
Unit I
38 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
Module - 2
No ratings yet
Module - 2
84 pages
Bda Mod 1
No ratings yet
Bda Mod 1
32 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Lecture 6
No ratings yet
Lecture 6
16 pages
Hadoop Basics and Big Data Overview
100% (2)
Hadoop Basics and Big Data Overview
42 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
248 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
169 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Understanding MapReduce and HDFS
No ratings yet
Understanding MapReduce and HDFS
4 pages
Unit 3
No ratings yet
Unit 3
5 pages
BDA Lec5
No ratings yet
BDA Lec5
40 pages
Big Data Solutions with Hadoop
No ratings yet
Big Data Solutions with Hadoop
27 pages
Hadoop & Hive Essentials
No ratings yet
Hadoop & Hive Essentials
30 pages
Hadoop Architecture and Data Flow Overview
No ratings yet
Hadoop Architecture and Data Flow Overview
84 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
HDFS Concepts and Command Line Guide
No ratings yet
HDFS Concepts and Command Line Guide
42 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Big Data Technologies
No ratings yet
Big Data Technologies
492 pages
Big Data Unit 2 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 2 (Easy Notes) Edushine Classes
35 pages
Unit 3 (Big Data Analytics)
No ratings yet
Unit 3 (Big Data Analytics)
18 pages
Hadoop: Origins and Industrial Use
No ratings yet
Hadoop: Origins and Industrial Use
25 pages
Siebel eBusiness Essentials Training
No ratings yet
Siebel eBusiness Essentials Training
17 pages
Account Username Convention Guide
No ratings yet
Account Username Convention Guide
23 pages
IT Risk Management Guide
No ratings yet
IT Risk Management Guide
20 pages
Veritas Cluster Cheatsheet
No ratings yet
Veritas Cluster Cheatsheet
6 pages
Three-Layer Privacy Cloud Storage Scheme
No ratings yet
Three-Layer Privacy Cloud Storage Scheme
22 pages
Nuxeo Studio Documentation
No ratings yet
Nuxeo Studio Documentation
134 pages
Chapter 10: File-System Interface
No ratings yet
Chapter 10: File-System Interface
45 pages
Security Analysis On Websites Belonging To The Health Service Districts in Indonesia Based On The Open Web Application Security Project OWASP Top 10 2021
No ratings yet
Security Analysis On Websites Belonging To The Health Service Districts in Indonesia Based On The Open Web Application Security Project OWASP Top 10 2021
6 pages
HIKvision File System
100% (1)
HIKvision File System
11 pages
Accuweather SRS Document LPU
No ratings yet
Accuweather SRS Document LPU
60 pages
Bank Management System
100% (7)
Bank Management System
36 pages
Database Lab Manual Solution: Name
No ratings yet
Database Lab Manual Solution: Name
18 pages
SF Resume
No ratings yet
SF Resume
3 pages
Ds Database Automation Pro Service
No ratings yet
Ds Database Automation Pro Service
4 pages
Computer Viruses
No ratings yet
Computer Viruses
16 pages
Skybox Virtual Appliance Quick Start Guide
No ratings yet
Skybox Virtual Appliance Quick Start Guide
45 pages
Virtual Mouse
No ratings yet
Virtual Mouse
7 pages
MongoDB - Course Curriculum
No ratings yet
MongoDB - Course Curriculum
5 pages
Buffalo Tech Back Up and Restore White Paper
No ratings yet
Buffalo Tech Back Up and Restore White Paper
5 pages
Release Plan
100% (1)
Release Plan
4 pages
Introduction To Database Concepts and Microsoft Access 2010: Instructor Notes
No ratings yet
Introduction To Database Concepts and Microsoft Access 2010: Instructor Notes
23 pages
Cloud Security Audits for Providers
No ratings yet
Cloud Security Audits for Providers
8 pages
MerlinCorp SAM Discovery Tool Comparison 2017
No ratings yet
MerlinCorp SAM Discovery Tool Comparison 2017
178 pages
Learning Content Management System - Abstract
No ratings yet
Learning Content Management System - Abstract
5 pages
Setting Up a ZooKeeper Ensemble
No ratings yet
Setting Up a ZooKeeper Ensemble
3 pages
Orient DB
0% (1)
Orient DB
14 pages
Agile Software Development Overview
No ratings yet
Agile Software Development Overview
44 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Csts 250 CB
No ratings yet
Csts 250 CB
110 pages
Azure AD Onboarding Services Overview
No ratings yet
Azure AD Onboarding Services Overview
2 pages

Module-II Introduction To Hadoop

Uploaded by

Module-II Introduction To Hadoop

Uploaded by

Big Data Analytics (BCS714D)

Sub- Big Data Analytics (BCS714D)

Dr Bindiya MK Chetana K.N

Key Features of HDFS

4. Data Locality–Processing happens where the data is stored to improve efficiency.

 Uses two files to keep track of data:

o Fs Image– Stores the entire file system structure.

o Edit Log–Tracks changes like file creation,deletion,orrenaming.

 Stores the actual data blocks.

 No ta backup NameNode but helps by periodically saving NameNode’s metadata.

How Data is Read from HDFS/ Anatomy of file Read Operation

Steps in Reading a File from HDFS

1. The client sends a read request to the NameNode.

2. The 3+NameNoderesponds withthelistofDataNodeswheretheblocksarestored.

4. Ifablockiscorrupted, theclient readsfrom another replica.

Anatomy of a File write Operation/ HowDataisWrittentoHDFS?

Steps in Writing a File to HDFS

1. The client requests file creation from the NameNode.

3. Data is divided into packets and sent to Data Nodes in a pipeline.

1. FirstReplica–Stored onthe same nodeastheclient.

Lists all directories and files at the root

hadoopfs-mkdir/sample Createsadirectorynamed sampleinHDFS.

[Link] /sample/ CopiesafilefromlocalstoragetoHDFS.

hadoop fs -copyFromLocal [Link]

hadoop fs -copyToLocal /sample/[Link]

hadoopfs-cat/sample/[Link] Displaysthecontent ofanHDFS file.

hadoopfs -rm-r /sample/ DeletesadirectoryfromHDFS.

Processing Data with Hadoop

Work flow of MapReduce

Step sin Map Reduce Execution

The execution of a Map Reduce job follows these steps:

Map Reduce Example:Word Count

Implementation of Word Countin Java

Driver Class ([Link])

[Link](job, new Path("/sample/[Link]"));

2. Limited Processing Model:The system primarily supported batch-orientedMapReduce

 As the number of files in a cluster increased, the NameNode became overloaded,

Toaddresstheselimitations,Hadoop2.x introduced HDFS Federation,which allowed multiple

[Link]:Expanding Hadoop Beyond BatchProcessing

Key Advantages of YARN

YARN introduces the following key components:

 Application Manager: Manages job submissions,resource negotiations,

o Runs on each node in the cluster.

o Monitors resource usage (CPU,memory,disk,network)and reportstothe

o Launches and tracks the execution of application containers.

3. Application Master(Per Application Manager)

o Manages the execution of a specific application.

o Negotiates required resources with the Resource Manager.

o Works with the Node Manager to launch tasks.

Basic Concepts in YARN

 An application in YARN refers to a job submitted for execution.

 Example: A Map Reduce job is an application that requires resources to

 Containers are the basic units of resource allocation in YARN.

 They allow fine-grained resource allocation for different typesof processing.

o container_0 =2GB RAM, 1 CPU

o container_1 =1GB RAM, 6 CPUs

1. A client submits an application to the Resource Manager.

2. The Resource Manager allocates a container to launch the Application Master.

Interaction Hadoop Eco System

1. Reduces the space to store files.

You might also like