Adobe Scan Dec 05, 2023
Adobe Scan Dec 05, 2023
UNIT 5
5.1 Hadoop:
Hadoop is an open-source software framework for storing data and running applicatons on
clusters of commodity hardware. It provides massive storage for any kind of data, enormous
processing power and the ability to handle virtualy limitless concurrent tasks or jobs.
5.1.1History of Hadoop:
In 2002, Doug Cutting and Mike Cafarela started to work on a project, Apache
Nutch. It is an open-source web crawler software project.
While working on Apache Nutch, they were dealing with big data. Tostore that
data, they have tospend a lot of costs which becomes the consequence of that
project. This problem becomes one of the important reasons for the emergence
of Hadoop.
In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access todata.
In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
In 2005, Doug Cutting and Mike Cafarella introduced anew file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.Doug Cutting gave named his project Hadoop after his son's toy elephant.
1000 machines.
In 2007, Yahoo runs two clusters of terabyte of data on a 900-node
system to sort 1
the fastest
In 2008, Hadoop became
cluster within 209 seconds.
MapReduce
(Distributed Computation)
HDFS
(Distributed Storage)
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at
Google for efficient processing of large amounts of data (multi-terabyte datasets). on large
Hadoop Common -These are Java libraries and utilities required by other Hadoop
modules.
Hadoop YARN - This is a framework for job scheduling and cluster resource
management.
5.1.3Working of Hadoop:
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
" HDFS, being on top of the local file system, supervises the processing.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
5.2 MapReduce:
Relational
Centralised Database
System
User
M M M Map Phase
Intemediate
Keys
R R R (R Reducer Phase
Reducer - The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered,
and combined in a number of ways, and it requires a wide range of processing. Once
the execution is over, itgives zero or more key-value pairs to the final step.
Output Phase - In the output phase, we have an output formatter that translates the
final key-value pairs from the Reducer function and writes them onto a file using a
record writer.
CLOUDCOMPUTING (RCS075/ROE073) 2020-21
below:
5.2.3Exanmple: wordcount example Is given
Reducing Final Result
Mapping Shufling
Input Splting
list(KZ, V2) K, List(V2)
KI, VI List(K3, V3)
Deer, 1
Bear, 1 Bear, (1,1) - Bear,2
Deer Bear River
Rver, 1
Ca,(1,1)- Ca,3 Bear, 2
Car,3
Deer Bear River Car, 1
Car Car River (ar Car River Car, 1 Deer, 2
River, 1 River,2
Der Car Bear Deer,(1,1)- Deer,2
Deer, 1
Deer Car Bear Car, River (1,1) River 2
Bear, 1
VirtualBox is ageneral-purpose virtualization tool for x86 and x86-64 hardware, targeted at
server, desktop, and embedded use, that allows users and administrators to easily run
multiple guest operating systems on a single host.
To start with, we will download VirtualBox and install it. We should follow the steps aiven
below for the installation.