0% found this document useful (0 votes)

25 views

Adobe Scan Dec 05, 2023

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, fault-tolerant distribution of data and processing across large clusters of commodity hardware. Hadoop uses HDFS for distributed storage and MapReduce as a programming model for distributed computing on large datasets. HDFS stores data reliably across cluster nodes and MapReduce allows for parallel processing of data in a distributed manner. Together, HDFS and MapReduce form the core of Hadoop and enable reliable, scalable computing on massive amounts of data.

Uploaded by

dk singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Adobe Scan Dec 05, 2023

Uploaded by

dk singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CLOUD COMPUTING(RCS075/ROE073) 2020-21

UNIT 5

CLOUD TECHNOLOGIES AND ADVANCEMENIS

5.1 Hadoop:
Hadoop is an open-source software framework for storing data and running applicatons on
clusters of commodity hardware. It provides massive storage for any kind of data, enormous
processing power and the ability to handle virtualy limitless concurrent tasks or jobs.

5.1.1History of Hadoop:
In 2002, Doug Cutting and Mike Cafarela started to work on a project, Apache
Nutch. It is an open-source web crawler software project.

While working on Apache Nutch, they were dealing with big data. Tostore that
data, they have tospend a lot of costs which becomes the consequence of that
project. This problem becomes one of the important reasons for the emergence
of Hadoop.

In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access todata.
In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
In 2005, Doug Cutting and Mike Cafarella introduced anew file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.Doug Cutting gave named his project Hadoop after his son's toy elephant.

132 Er. Karan Kumar

CLOUD COMPUTING (RCS075/ROEO73) 2020-21

1000 machines.
In 2007, Yahoo runs two clusters of terabyte of data on a 900-node
system to sort 1
the fastest
In 2008, Hadoop became
cluster within 209 seconds.

In 2013, Hadoop 22 was released.

In 2017, Hadoop 3.0 was released.
In 2018, Apache Hadoop 3. 1version released.

5.1.2 Hadoop Architecture:

MapReduce
(Distributed Computation)

HDFS
(Distributed Storage)

YARN Framework Common Utilities

Fig 5.1 Hadoop Architecture

Hadoop architecture consists of following modules:

MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte datasets). on large

133 Er. Karan Kumar

2020.21
CLOUDCOMPUTING (RCS075/ROEO73)

in areliable, fault-tolerant manner.

CIusters (thousands of nodes) of commodity hardware
an Apache open-source Tramework.
Tie MapReduce program runs on Hadoon which is
Hadoop Distributed File System
on the Google File System (GFS) and
adoop Distributed File System (HDFS) is based hardware. It has
run on commodity
PovdeS adistributed file system that is designed to
However, the differences from other
Tmany smilarities with existing distributed file systems.
designed to be
disibuted file systems are significant. It is highly fault-tolerant and is
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart trom the above-mentioned two core components, Hadoop framework also includes the
following two modules -

Hadoop Common -These are Java libraries and utilities required by other Hadoop
modules.

Hadoop YARN - This is a framework for job scheduling and cluster resource
management.

5.1.3Working of Hadoop:

Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
" HDFS, being on top of the local file system, supervises the processing.

Blocks are replicated for handling hardware failure.

Checking that the code was executed successfully.

Performing the sort that takes place between the map and reduce stages.

134 Er. Karan Kumar

CLOUD COMPUTING (RCS075/ROE073) 2020-21
Sending the sorted data to a certain computer.
Witing the debugging logs for each job.
5.1.4 Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
emcient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.

Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
5.2 MapReduce:

AMapReduce is adata processing tool which is

used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as
Simplified Data Processing on Large Clusters," published by Google. "MapReduce:

5.2.1 Why MapReduce?

Traditional Enterprise Systems normally have acentralized server to
The following illustration depicts a store and process data.
schematic viewW of a traditional enterprise system.
Traditional model is certainly not suitable to process huge
cannot be accommodated by standard volumes of scalable data and
database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple
files simultaneouslv.

135 Er. Karan Kumar

CLOUD COMPUTING (RCS075/ROE073) 2020-21

Relational
Centralised Database
System
User

Fig 5.2 Traditional System

called MapReduce. MapReduce
oogle sOlved this bottleneck issue using an algorithm
results are
divides a task into small parts and assians them to many computers. Later, the
collected at one place and integrated to form the result dataset.

5.2.2 How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes aset of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
" The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
Input Wp Input Input InputInput Input Input Phase

M M M Map Phase

Intemediate
Keys

Group by Key Combiner

(Cotiona)
ki:v,v,v,v k2:v k3:v,v k4:v,y,v k5:v Shuffle &Sort

R R R (R Reducer Phase

OUTPUT Output Phase

Fig 5.3 WNorking of MapReduce

cLOUD COMPUTING (RCS075/ROE073)2020-21

Record Reader that translates each record in an inout

InputPhase - Here we have a
mapper in the form of key-value pairs
to the
hie and sends the parsed data
function, which takes a series of key-value pairs and
Map - Map is a user-defined key-value pairs.
them to generate zero or more
processes each one of
They key-value pairs generated by the mapper are known as
" Intermediate Keys -
intermediate keys.
type of local Reducer that groups similar data from the
Combiner - A combiner is a
identifiable sets. It takes the intermediate keys from the mapper as
map phase into
user-defined code to aggregate the values in asmallscope of one
input and applies a
algorithm; it is optional.
mapper. It is not a part of the main MapReduce
Shuffle and Sort step. It
Shuffle and Sort- The Reducer task starts with the
where the Reducer is
downloads the grouped key-value pairs onto the local machine,
list. The
running. The individual key-value pairs are sorted by key into a larger data
easily
data list groups the equivalent keys together so that their values can be iterated
in the Reducer task.

Reducer - The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide range of processing. Once
the execution is over, itgives zero or more key-value pairs to the final step.
Output Phase - In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.
CLOUDCOMPUTING (RCS075/ROE073) 2020-21

below:
5.2.3Exanmple: wordcount example Is given
Reducing Final Result
Mapping Shufling
Input Splting
list(KZ, V2) K, List(V2)
KI, VI List(K3, V3)
Deer, 1
Bear, 1 Bear, (1,1) - Bear,2
Deer Bear River
Rver, 1
Ca,(1,1)- Ca,3 Bear, 2
Car,3
Deer Bear River Car, 1
Car Car River (ar Car River Car, 1 Deer, 2
River, 1 River,2
Der Car Bear Deer,(1,1)- Deer,2
Deer, 1
Deer Car Bear Car, River (1,1) River 2
Bear, 1

Fig 5.4Wordcount example

5.3 Virtual Box:

VirtualBox is ageneral-purpose virtualization tool for x86 and x86-64 hardware, targeted at
server, desktop, and embedded use, that allows users and administrators to easily run
multiple guest operating systems on a single host.

5.3.1 Installing VirtualBox

To start with, we will download VirtualBox and install it. We should follow the steps aiven
below for the installation.

Step 1- To download VirtualBox, click on the

following
link https:www.virtualbox.org/wiki/Downloads Now, depending on your OS, select which
version to install. In our case, it will be the first one (Windows host).

138 Er. Karan Kumar

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Hadoop and Mapreduce
No ratings yet
Hadoop and Mapreduce
21 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Unit 5
No ratings yet
Unit 5
7 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop
No ratings yet
Hadoop
34 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
No ratings yet
Chap 3-5.-Hadoop Ecosystem YARN MapReduce - 1
87 pages
By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150
No ratings yet
By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150
17 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Unit 5
No ratings yet
Unit 5
35 pages
Ha Do Op
No ratings yet
Ha Do Op
24 pages
Big Data - Hadoop
No ratings yet
Big Data - Hadoop
20 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Case Studies and Advancements: Unit: 5
No ratings yet
Case Studies and Advancements: Unit: 5
54 pages
CC unit5
No ratings yet
CC unit5
27 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Hadoop
No ratings yet
Hadoop
11 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
CC-Unit 3
No ratings yet
CC-Unit 3
22 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
bda megh
No ratings yet
bda megh
50 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
05-MapReduce and Yarn
No ratings yet
05-MapReduce and Yarn
82 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
No ratings yet
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
61 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
JETIR2202230
No ratings yet
JETIR2202230
4 pages
IJRPR12745
No ratings yet
IJRPR12745
5 pages
FULLTEXT01
No ratings yet
FULLTEXT01
60 pages
Machine Learning
No ratings yet
Machine Learning
43 pages
1693746330937-MongoDB Course
No ratings yet
1693746330937-MongoDB Course
82 pages
Zen Python Automation Testing Syllabus
No ratings yet
Zen Python Automation Testing Syllabus
12 pages
ML Unit-4 (Complete Notes)
No ratings yet
ML Unit-4 (Complete Notes)
20 pages
Decision Trees Boosting Example Problem
No ratings yet
Decision Trees Boosting Example Problem
10 pages
Slides
No ratings yet
Slides
302 pages
DK Mini P
No ratings yet
DK Mini P
42 pages
Image To PDF - 24122023 - 205737
No ratings yet
Image To PDF - 24122023 - 205737
10 pages
Notice
No ratings yet
Notice
1 page
Software Engineer Job Description
No ratings yet
Software Engineer Job Description
3 pages
Automation Testing Curriculum
No ratings yet
Automation Testing Curriculum
11 pages
OOP Lab Manual
No ratings yet
OOP Lab Manual
53 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Autodesk Generic File Sharing Vs Bim 360 Design
No ratings yet
Autodesk Generic File Sharing Vs Bim 360 Design
2 pages
Xu Hao Xiabo Chen Find Your Own IOS Kernel Bug
No ratings yet
Xu Hao Xiabo Chen Find Your Own IOS Kernel Bug
74 pages
Detailed User Requirements
No ratings yet
Detailed User Requirements
9 pages
Shopify Theme Customization With Liquid
No ratings yet
Shopify Theme Customization With Liquid
2 pages
Litwin - Creating Dazzling Charts With The Microsoft Chart Control
No ratings yet
Litwin - Creating Dazzling Charts With The Microsoft Chart Control
24 pages
Ew Mod
No ratings yet
Ew Mod
218 pages
Verinon Quick Intro - SharePoint Migration, Dev & Support Services
No ratings yet
Verinon Quick Intro - SharePoint Migration, Dev & Support Services
22 pages
ZK
No ratings yet
ZK
21 pages
Java Design Patterns
No ratings yet
Java Design Patterns
13 pages
Java Programming Lab Manual R18 JNTUH 2
No ratings yet
Java Programming Lab Manual R18 JNTUH 2
43 pages
Instalación de Oracle y APEX en CentOS 8 - Linode
No ratings yet
Instalación de Oracle y APEX en CentOS 8 - Linode
11 pages
Akansha Singh Internship
No ratings yet
Akansha Singh Internship
26 pages
B.tech 2 1 Computer Science Engineering R20 Course Structure Syllabi
No ratings yet
B.tech 2 1 Computer Science Engineering R20 Course Structure Syllabi
22 pages
SLPS 013.00
No ratings yet
SLPS 013.00
1 page
Sensor Projects with Raspberry Pi Internet of Things and Digital Image Processing 2nd Edition Guillermo Guillen instant download
100% (1)
Sensor Projects with Raspberry Pi Internet of Things and Digital Image Processing 2nd Edition Guillermo Guillen instant download
86 pages
Intro To Starlogo Nova and Flower Tutle Challenge With Extentions
No ratings yet
Intro To Starlogo Nova and Flower Tutle Challenge With Extentions
15 pages
Paie
No ratings yet
Paie
2 pages
2.OS Question Bank-2018 Svit
100% (1)
2.OS Question Bank-2018 Svit
22 pages
Tracing, Analysis and Simulation For Embedded Systems: Rickard Olsson @rickard - Olsson Reza Javaheri @reza - Javaheri
No ratings yet
Tracing, Analysis and Simulation For Embedded Systems: Rickard Olsson @rickard - Olsson Reza Javaheri @reza - Javaheri
33 pages
Week 1 Introduction To Software Engineering
No ratings yet
Week 1 Introduction To Software Engineering
28 pages
S4HANA End To End Hierarchy Scenario
No ratings yet
S4HANA End To End Hierarchy Scenario
4 pages
Introduction To Game Design PDF
100% (1)
Introduction To Game Design PDF
106 pages
The Manyfoot Package: 1 User Interface
No ratings yet
The Manyfoot Package: 1 User Interface
23 pages
HTML Calculator
No ratings yet
HTML Calculator
6 pages
MCQs On 2nd
No ratings yet
MCQs On 2nd
24 pages
Tableau Desktop
100% (1)
Tableau Desktop
3,545 pages
Vas IT
No ratings yet
Vas IT
4 pages

Adobe Scan Dec 05, 2023

Uploaded by

Adobe Scan Dec 05, 2023

Uploaded by

CLOUD COMPUTING(RCS075/ROE073) 2020-21

CLOUD TECHNOLOGIES AND ADVANCEMENIS

132 Er. Karan Kumar

In 2013, Hadoop 22 was released.

5.1.2 Hadoop Architecture:

YARN Framework Common Utilities

Fig 5.1 Hadoop Architecture

Hadoop architecture consists of following modules:

133 Er. Karan Kumar

in areliable, fault-tolerant manner.

Blocks are replicated for handling hardware failure.

Checking that the code was executed successfully.

134 Er. Karan Kumar

AMapReduce is adata processing tool which is

5.2.1 Why MapReduce?

135 Er. Karan Kumar

Fig 5.2 Traditional System

5.2.2 How MapReduce Works?

Group by Key Combiner

OUTPUT Output Phase

Fig 5.3 WNorking of MapReduce

Record Reader that translates each record in an inout

Fig 5.4Wordcount example

5.3 Virtual Box:

5.3.1 Installing VirtualBox

Step 1- To download VirtualBox, click on the

138 Er. Karan Kumar

You might also like