12 Lecture

Uploaded by

zartasha574

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

12 Lecture

Uploaded by

zartasha574

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Hadoop Architecture

LECTURE 12

Zaeem Anwaar
Assistant Director IT
Hadoop Architecture
HDFS
HDFS is an Open source component of the Apache Software Foundation that
stores and manages data
Name nodes, secondary name nodes, data nodes and blocks all make up the
architecture of HDFS
HDFS has key features of:
Scalability
Availability
Fault Tolerance
Replication
Rack awareness
Name node
Monitor and control all the Data Nodes instances.
Permits the user to access (read and write) a file.
Stores all of the block records on a Data Node instance.
Edit Logs are committed to disk after every write operation to Name Node’s data storage.
The data is then replicated to all the other data nodes, including Data Node and Backup
Data Node.
There are two Kinds of files in Name Node:
FsImage: It contains all the details about a filesystem (metadata), including all the
directories and files, in a hierarchical format. It is also called a file image because it
resembles a photograph.
EditLogs: The EditLogs file keeps track of what modifications have been made to the
files of the filesystem.
Secondary NameNode
When NameNode runs out of disk space/it becomes unavailable, a secondary NameNode is
activated to perform its actions.
It is the ISO (IMAGE file) for Name Node.
Data Nodes and Blocks
Every slave machine that contains/stores data, organizes data blocks and replication blocks
in it
It handles all of the requested operations on files, such as reading file writing file, content
and creating new data
Data Nodes are replicated to ensure data consistency and fault tolerance.
It periodically (every 3 second) receives a Heartbeat and a Block report from each of the
Data Nodes in the cluster that it is functioning properly
If a Node fails, the system automatically recovers the data from a backup and replicates it
across the remaining healthy Nodes.
When the file size gets bigger, the block size gets bigger as well. When the file size
becomes bigger than the block size, the larger data is placed in the next block.
For example, if the data is 135 MB and the block size is 128 MB, two blocks will be created. The
first block will be 128 MB, while the second block will store remaining 7 MB and 121 MB will
be used for next data in the same block
Data node and Block replication:
HDFS Architecture
MapReduce
It performs the accessing of large data in a distributed and parallel manner
Consist of two distinct task/class: Map/Mapper/Mapping and Reduce/Reducer
Mapper is dividing a problem into (Key, Value)
Reducer count the similar keys and add up their values

I MAP
Reduce
N
Output
P MAP
Reduce
U
MAP
T
Example already discussed in previous lecture
MapReduce Story
Map Reduce trackers
Master job tracker (only one)
Resource management
Scheduling tasks
Scheduling algorithms
First-Come, First-Served (follows FIFO)
Multiple-Level Queues
Shortest Remaining Time (SRT) – nearest to completion

Monitoring tasks
Slave task tracker (more then one)
Executes the tasks
Provides tasks status
MapReduce works on ‘Divide and Conquer’ Algorithm

A problem-solving approach in data structure and algorithms that divide the

problem into smaller subproblems, recursively solve each subproblem, and
combine the solutions to the subproblems to get the solution of the original
problem.
Subproblem is divided till the problem becomes independent unit
One of the Example is:
Merge Sort
Merge Sort

6 4 2 1 9 8 3 5
Terminologies
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
Name Node − Node that manages the Hadoop Distributed File System (HDFS).
Data Node − Node where data is presented in advance before any processing takes place.
Master Node − Node where Job Tracker runs and which accepts job requests from clients.
Slave Node − Node where Map and Reduce program runs.
Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to Job Tracker.
Task − An execution of a Mapper or a Reducer on a slice of input data.
Activity:
Apply merge sort
38,27,43,3,9,82,10
Scaling in Cloud Computing

Scaling refers to scalability in Cloud services (SaaS, PaaS, Hadoop etc.)

Increasing or decreasing IT computing resources/data storage capacity/ processing
power/networking resources as needed to meet changing demand.
Types of Scalability
Vertical scalability (Scaling up)
Horizontal scalability (Scaling out)
Vertical and Horizontal Scalability
Vertical Scalability

Imagine a 20-story hotel. There are innumerable rooms inside this hotel from where the
guests keep coming and going. Often there are spaces available, as not all rooms are filled
at once. People can move easily as there is space for them. As long as the capacity of this
hotel is not exceeded, no problem. This is vertical scaling.
With computing, you can add or subtract resources, including memory or storage, within
the server, as long as the resources do not exceed the capacity of the machine.
Although it has its limitations, it is a way to improve your server and avoid latency and
extra management. Like in the hotel example, resources can come and go easily and
quickly, as long as there is room for them.
Horizontal Scalability

Imagine a two-lane highway. Cars travel smoothly in each direction without major traffic
problems. But then the area around the highway develops - new buildings are built, and
traffic increases. Very soon, this two-lane highway is filled with cars, and accidents
become common. Two lanes are no longer enough. To avoid these issues, more lanes are
added, and an overpass is constructed. Although it takes a long time, it solves the problem.
Horizontal scaling refers to adding more servers to your network, rather than simply adding
resources like with vertical scaling.
This method tends to take more time and is more complex, but it allows you to connect
servers together, handle traffic efficiently and execute concurrent workloads.
Benefits of Cloud Scalability

Convenience: With just a few clicks, IT administrators can easily

add more VMs that are available-and customized to an
organization's exact needs-without delay.
Flexibility and speed: As business needs change and grow, including
unexpected demand spikes, cloud scalability allows IT to respond
quickly.
Cost Savings: Businesses can avoid the upfront cost of purchasing
expensive equipment that can become obsolete in a few years.
Through cloud providers, they only pay for what they use and
reduce waste.

Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Getdb PDF
100% (1)
Getdb PDF
25 pages
OpenText Documentum Xcelerated Composition Platform CE 23.2 - Deployment Guide English (EDCPKL230200-IGD-EN-02)
No ratings yet
OpenText Documentum Xcelerated Composition Platform CE 23.2 - Deployment Guide English (EDCPKL230200-IGD-EN-02)
228 pages
Hadoop Classroom Notes
100% (2)
Hadoop Classroom Notes
76 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Using Parameterised Classes
No ratings yet
Using Parameterised Classes
14 pages
BDA Notes Unit-4
No ratings yet
BDA Notes Unit-4
86 pages
Unit 3
No ratings yet
Unit 3
10 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)
No ratings yet
Big Data Analytics (2017 Regulation) : Hadoop Distributed File System (HDFS)
7 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
18 module 2
No ratings yet
18 module 2
9 pages
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
No ratings yet
Nosql and Hadoop Technologies On Oracle Cloud: Volume 2, Issue 2, March - April 2013
6 pages
Unit 3 Bba
No ratings yet
Unit 3 Bba
11 pages
Big Data
No ratings yet
Big Data
17 pages
Unit 5 - Big Data Ecosystem - 06.05.18
No ratings yet
Unit 5 - Big Data Ecosystem - 06.05.18
21 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Assignment questions BDA Lec 6
No ratings yet
Assignment questions BDA Lec 6
51 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
BDA- UNIT 3
No ratings yet
BDA- UNIT 3
41 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
BDA - II Sem - II Mid
100% (1)
BDA - II Sem - II Mid
4 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
BDA Lab Assignment 4 PDF
No ratings yet
BDA Lab Assignment 4 PDF
21 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
10 pages
Unit 5
No ratings yet
Unit 5
7 pages
DSCC UNIT 5 PDF
No ratings yet
DSCC UNIT 5 PDF
8 pages
learn
No ratings yet
learn
16 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
Hadoop
No ratings yet
Hadoop
5 pages
DataStage Theory Part
No ratings yet
DataStage Theory Part
28 pages
DataStage Theory Part
No ratings yet
DataStage Theory Part
18 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Top Answers To Map Reduce Interview Questions: Criteria Mapreduce Spark
No ratings yet
Top Answers To Map Reduce Interview Questions: Criteria Mapreduce Spark
2 pages
Cluster Based Load Rebalancing in Clouds
No ratings yet
Cluster Based Load Rebalancing in Clouds
5 pages
Rohit
No ratings yet
Rohit
14 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Data Science
No ratings yet
Data Science
7 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
CLOUD UNIT 5
No ratings yet
CLOUD UNIT 5
52 pages
Untitled Document
No ratings yet
Untitled Document
5 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
PL SQL Quick Reference
No ratings yet
PL SQL Quick Reference
50 pages
Proces rm200 - en P
No ratings yet
Proces rm200 - en P
538 pages
7.2 - Data Frame Basics.mp4
No ratings yet
7.2 - Data Frame Basics.mp4
3 pages
R23_Java_Syllabus
No ratings yet
R23_Java_Syllabus
3 pages
JavaScript Developer Roadmap - Step by Step Guide To Learn JavaScript
No ratings yet
JavaScript Developer Roadmap - Step by Step Guide To Learn JavaScript
3 pages
Weather Prediction Using CPT+ Algorithm: Proposed Scheme
No ratings yet
Weather Prediction Using CPT+ Algorithm: Proposed Scheme
12 pages
JSP and JSTL
No ratings yet
JSP and JSTL
105 pages
Chapter Four Control Structures 4.1. The CMP Instruction: Page 1 of 5
No ratings yet
Chapter Four Control Structures 4.1. The CMP Instruction: Page 1 of 5
5 pages
Report Mayank
No ratings yet
Report Mayank
26 pages
Developer and IT Pro Documentation Microsoft Docs
No ratings yet
Developer and IT Pro Documentation Microsoft Docs
2,144 pages
Mca 3 Sem Web Technology Kca021 2022
No ratings yet
Mca 3 Sem Web Technology Kca021 2022
2 pages
RedBot Pro Scripting Guide
No ratings yet
RedBot Pro Scripting Guide
10 pages
Log Cat 1709942944875
No ratings yet
Log Cat 1709942944875
135 pages
Integration Engine Configuration Steps
No ratings yet
Integration Engine Configuration Steps
2 pages
Week 1: Intro To SQL: INFS1603/COMM1822 Business Databases
No ratings yet
Week 1: Intro To SQL: INFS1603/COMM1822 Business Databases
38 pages
Android Adapters, An Introduction
No ratings yet
Android Adapters, An Introduction
73 pages
A Project Report On " ": Online Flipkart
No ratings yet
A Project Report On " ": Online Flipkart
7 pages
Paper 5 A Online Job Portal Management System
No ratings yet
Paper 5 A Online Job Portal Management System
5 pages
Software Development Lifecycle
No ratings yet
Software Development Lifecycle
6 pages
Ms Office (Date & Time Formula)
No ratings yet
Ms Office (Date & Time Formula)
4 pages
Data Structures VIVA Questions and Answers
100% (1)
Data Structures VIVA Questions and Answers
6 pages
Resume Aniket Kumar LKD Updated
No ratings yet
Resume Aniket Kumar LKD Updated
2 pages
MAD Lab 4
No ratings yet
MAD Lab 4
17 pages
IV Sem DS and RDBMS
No ratings yet
IV Sem DS and RDBMS
3 pages
Benchmarksql - 2.3.2 Installation and User Guide: Environment
No ratings yet
Benchmarksql - 2.3.2 Installation and User Guide: Environment
13 pages
The Mythical Man-Month
100% (2)
The Mythical Man-Month
16 pages
Akul PHP Practical File (Edited Version) - Page - Number
No ratings yet
Akul PHP Practical File (Edited Version) - Page - Number
52 pages

12 Lecture

Uploaded by

12 Lecture

Uploaded by

Hadoop Architecture

A problem-solving approach in data structure and algorithms that divide the

Scaling refers to scalability in Cloud services (SaaS, PaaS, Hadoop etc.)

Convenience: With just a few clicks, IT administrators can easily

You might also like