0% found this document useful (0 votes)
7 views

12 Lecture

Uploaded by

zartasha574
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

12 Lecture

Uploaded by

zartasha574
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Hadoop Architecture

LECTURE 12

Zaeem Anwaar
Assistant Director IT
Hadoop Architecture
HDFS
HDFS is an Open source component of the Apache Software Foundation that
stores and manages data
Name nodes, secondary name nodes, data nodes and blocks all make up the
architecture of HDFS
HDFS has key features of:
Scalability
Availability
Fault Tolerance
Replication
Rack awareness
Name node
Monitor and control all the Data Nodes instances.
Permits the user to access (read and write) a file.
Stores all of the block records on a Data Node instance.
Edit Logs are committed to disk after every write operation to Name Node’s data storage.
The data is then replicated to all the other data nodes, including Data Node and Backup
Data Node.
There are two Kinds of files in Name Node:
FsImage: It contains all the details about a filesystem (metadata), including all the
directories and files, in a hierarchical format. It is also called a file image because it
resembles a photograph.
EditLogs: The EditLogs file keeps track of what modifications have been made to the
files of the filesystem.
Secondary NameNode
When NameNode runs out of disk space/it becomes unavailable, a secondary NameNode is
activated to perform its actions.
It is the ISO (IMAGE file) for Name Node.
Data Nodes and Blocks
Every slave machine that contains/stores data, organizes data blocks and replication blocks
in it
It handles all of the requested operations on files, such as reading file writing file, content
and creating new data
Data Nodes are replicated to ensure data consistency and fault tolerance.
It periodically (every 3 second) receives a Heartbeat and a Block report from each of the
Data Nodes in the cluster that it is functioning properly
If a Node fails, the system automatically recovers the data from a backup and replicates it
across the remaining healthy Nodes.
When the file size gets bigger, the block size gets bigger as well. When the file size
becomes bigger than the block size, the larger data is placed in the next block.
For example, if the data is 135 MB and the block size is 128 MB, two blocks will be created. The
first block will be 128 MB, while the second block will store remaining 7 MB and 121 MB will
be used for next data in the same block
Data node and Block replication:
HDFS Architecture
MapReduce
It performs the accessing of large data in a distributed and parallel manner
Consist of two distinct task/class: Map/Mapper/Mapping and Reduce/Reducer
Mapper is dividing a problem into (Key, Value)
Reducer count the similar keys and add up their values

I MAP
Reduce
N
Output
P MAP
Reduce
U
MAP
T
Example already discussed in previous lecture
MapReduce Story
Map Reduce trackers
Master job tracker (only one)
Resource management
Scheduling tasks
Scheduling algorithms
First-Come, First-Served (follows FIFO)
Multiple-Level Queues
Shortest Remaining Time (SRT) – nearest to completion

Monitoring tasks
Slave task tracker (more then one)
Executes the tasks
Provides tasks status
MapReduce works on ‘Divide and Conquer’ Algorithm

A problem-solving approach in data structure and algorithms that divide the


problem into smaller subproblems, recursively solve each subproblem, and
combine the solutions to the subproblems to get the solution of the original
problem.
Subproblem is divided till the problem becomes independent unit
One of the Example is:
Merge Sort
Merge Sort

6 4 2 1 9 8 3 5
Terminologies
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
Name Node − Node that manages the Hadoop Distributed File System (HDFS).
Data Node − Node where data is presented in advance before any processing takes place.
Master Node − Node where Job Tracker runs and which accepts job requests from clients.
Slave Node − Node where Map and Reduce program runs.
Job Tracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to Job Tracker.
Task − An execution of a Mapper or a Reducer on a slice of input data.
Activity:
Apply merge sort
38,27,43,3,9,82,10
Scaling in Cloud Computing

Scaling refers to scalability in Cloud services (SaaS, PaaS, Hadoop etc.)


Increasing or decreasing IT computing resources/data storage capacity/ processing
power/networking resources as needed to meet changing demand.
Types of Scalability
Vertical scalability (Scaling up)
Horizontal scalability (Scaling out)
Vertical and Horizontal Scalability
Vertical Scalability

Imagine a 20-story hotel. There are innumerable rooms inside this hotel from where the
guests keep coming and going. Often there are spaces available, as not all rooms are filled
at once. People can move easily as there is space for them. As long as the capacity of this
hotel is not exceeded, no problem. This is vertical scaling.
With computing, you can add or subtract resources, including memory or storage, within
the server, as long as the resources do not exceed the capacity of the machine.
Although it has its limitations, it is a way to improve your server and avoid latency and
extra management. Like in the hotel example, resources can come and go easily and
quickly, as long as there is room for them.
Horizontal Scalability

Imagine a two-lane highway. Cars travel smoothly in each direction without major traffic
problems. But then the area around the highway develops - new buildings are built, and
traffic increases. Very soon, this two-lane highway is filled with cars, and accidents
become common. Two lanes are no longer enough. To avoid these issues, more lanes are
added, and an overpass is constructed. Although it takes a long time, it solves the problem.
Horizontal scaling refers to adding more servers to your network, rather than simply adding
resources like with vertical scaling.
This method tends to take more time and is more complex, but it allows you to connect
servers together, handle traffic efficiently and execute concurrent workloads.
Benefits of Cloud Scalability

Convenience: With just a few clicks, IT administrators can easily


add more VMs that are available-and customized to an
organization's exact needs-without delay.
Flexibility and speed: As business needs change and grow, including
unexpected demand spikes, cloud scalability allows IT to respond
quickly.
Cost Savings: Businesses can avoid the upfront cost of purchasing
expensive equipment that can become obsolete in a few years.
Through cloud providers, they only pay for what they use and
reduce waste.

You might also like