0% found this document useful (0 votes)
25 views

Adobe Scan Dec 05, 2023

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, fault-tolerant distribution of data and processing across large clusters of commodity hardware. Hadoop uses HDFS for distributed storage and MapReduce as a programming model for distributed computing on large datasets. HDFS stores data reliably across cluster nodes and MapReduce allows for parallel processing of data in a distributed manner. Together, HDFS and MapReduce form the core of Hadoop and enable reliable, scalable computing on massive amounts of data.

Uploaded by

dk singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Adobe Scan Dec 05, 2023

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, fault-tolerant distribution of data and processing across large clusters of commodity hardware. Hadoop uses HDFS for distributed storage and MapReduce as a programming model for distributed computing on large datasets. HDFS stores data reliably across cluster nodes and MapReduce allows for parallel processing of data in a distributed manner. Together, HDFS and MapReduce form the core of Hadoop and enable reliable, scalable computing on massive amounts of data.

Uploaded by

dk singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CLOUD COMPUTING(RCS075/ROE073) 2020-21

UNIT 5

CLOUD TECHNOLOGIES AND ADVANCEMENIS

5.1 Hadoop:
Hadoop is an open-source software framework for storing data and running applicatons on
clusters of commodity hardware. It provides massive storage for any kind of data, enormous
processing power and the ability to handle virtualy limitless concurrent tasks or jobs.

5.1.1History of Hadoop:
In 2002, Doug Cutting and Mike Cafarela started to work on a project, Apache
Nutch. It is an open-source web crawler software project.

While working on Apache Nutch, they were dealing with big data. Tostore that
data, they have tospend a lot of costs which becomes the consequence of that
project. This problem becomes one of the important reasons for the emergence
of Hadoop.

In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access todata.
In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
In 2005, Doug Cutting and Mike Cafarella introduced anew file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.Doug Cutting gave named his project Hadoop after his son's toy elephant.

132 Er. Karan Kumar


CLOUD COMPUTING (RCS075/ROEO73) 2020-21

1000 machines.
In 2007, Yahoo runs two clusters of terabyte of data on a 900-node
system to sort 1
the fastest
In 2008, Hadoop became
cluster within 209 seconds.

In 2013, Hadoop 22 was released.


In 2017, Hadoop 3.0 was released.
In 2018, Apache Hadoop 3. 1version released.

5.1.2 Hadoop Architecture:

MapReduce
(Distributed Computation)

HDFS
(Distributed Storage)

YARN Framework Common Utilities

Fig 5.1 Hadoop Architecture

Hadoop architecture consists of following modules:

MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte datasets). on large

133 Er. Karan Kumar


2020.21
CLOUDCOMPUTING (RCS075/ROEO73)

in areliable, fault-tolerant manner.


CIusters (thousands of nodes) of commodity hardware
an Apache open-source Tramework.
Tie MapReduce program runs on Hadoon which is
Hadoop Distributed File System
on the Google File System (GFS) and
adoop Distributed File System (HDFS) is based hardware. It has
run on commodity
PovdeS adistributed file system that is designed to
However, the differences from other
Tmany smilarities with existing distributed file systems.
designed to be
disibuted file systems are significant. It is highly fault-tolerant and is
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart trom the above-mentioned two core components, Hadoop framework also includes the
following two modules -

Hadoop Common -These are Java libraries and utilities required by other Hadoop
modules.

Hadoop YARN - This is a framework for job scheduling and cluster resource
management.

5.1.3Working of Hadoop:

Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
" HDFS, being on top of the local file system, supervises the processing.

Blocks are replicated for handling hardware failure.

Checking that the code was executed successfully.


Performing the sort that takes place between the map and reduce stages.

134 Er. Karan Kumar


CLOUD COMPUTING (RCS075/ROE073) 2020-21
Sending the sorted data to a certain computer.
Witing the debugging logs for each job.
5.1.4 Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
emcient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.

Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
5.2 MapReduce:

AMapReduce is adata processing tool which is


used to process the data parallelly in a
distributed form. It was developed in 2004, on the basis of paper titled as
Simplified Data Processing on Large Clusters," published by Google. "MapReduce:

5.2.1 Why MapReduce?


Traditional Enterprise Systems normally have acentralized server to
The following illustration depicts a store and process data.
schematic viewW of a traditional enterprise system.
Traditional model is certainly not suitable to process huge
cannot be accommodated by standard volumes of scalable data and
database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple
files simultaneouslv.

135 Er. Karan Kumar


CLOUD COMPUTING (RCS075/ROE073) 2020-21

Relational
Centralised Database
System
User

Fig 5.2 Traditional System


called MapReduce. MapReduce
oogle sOlved this bottleneck issue using an algorithm
results are
divides a task into small parts and assians them to many computers. Later, the
collected at one place and integrated to form the result dataset.

5.2.2 How MapReduce Works?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes aset of data and converts it into another set of data, where
individual elements are broken down into tuples (key-value pairs).
" The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
Input Wp Input Input InputInput Input Input Phase

M M M Map Phase

Intemediate
Keys

Group by Key Combiner


(Cotiona)
ki:v,v,v,v k2:v k3:v,v k4:v,y,v k5:v Shuffle &Sort

R R R (R Reducer Phase

OUTPUT Output Phase

Fig 5.3 WNorking of MapReduce


cLOUD COMPUTING (RCS075/ROE073)2020-21

Record Reader that translates each record in an inout


InputPhase - Here we have a
mapper in the form of key-value pairs
to the
hie and sends the parsed data
function, which takes a series of key-value pairs and
Map - Map is a user-defined key-value pairs.
them to generate zero or more
processes each one of
They key-value pairs generated by the mapper are known as
" Intermediate Keys -
intermediate keys.
type of local Reducer that groups similar data from the
Combiner - A combiner is a
identifiable sets. It takes the intermediate keys from the mapper as
map phase into
user-defined code to aggregate the values in asmallscope of one
input and applies a
algorithm; it is optional.
mapper. It is not a part of the main MapReduce
Shuffle and Sort step. It
Shuffle and Sort- The Reducer task starts with the
where the Reducer is
downloads the grouped key-value pairs onto the local machine,
list. The
running. The individual key-value pairs are sorted by key into a larger data
easily
data list groups the equivalent keys together so that their values can be iterated
in the Reducer task.

Reducer - The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide range of processing. Once
the execution is over, itgives zero or more key-value pairs to the final step.
Output Phase - In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.
CLOUDCOMPUTING (RCS075/ROE073) 2020-21

below:
5.2.3Exanmple: wordcount example Is given
Reducing Final Result
Mapping Shufling
Input Splting
list(KZ, V2) K, List(V2)
KI, VI List(K3, V3)
Deer, 1
Bear, 1 Bear, (1,1) - Bear,2
Deer Bear River
Rver, 1
Ca,(1,1)- Ca,3 Bear, 2
Car,3
Deer Bear River Car, 1
Car Car River (ar Car River Car, 1 Deer, 2
River, 1 River,2
Der Car Bear Deer,(1,1)- Deer,2
Deer, 1
Deer Car Bear Car, River (1,1) River 2
Bear, 1

Fig 5.4Wordcount example

5.3 Virtual Box:

VirtualBox is ageneral-purpose virtualization tool for x86 and x86-64 hardware, targeted at
server, desktop, and embedded use, that allows users and administrators to easily run
multiple guest operating systems on a single host.

5.3.1 Installing VirtualBox

To start with, we will download VirtualBox and install it. We should follow the steps aiven
below for the installation.

Step 1- To download VirtualBox, click on the


following
link https:www.virtualbox.org/wiki/Downloads Now, depending on your OS, select which
version to install. In our case, it will be the first one (Windows host).

138 Er. Karan Kumar

You might also like