Index
Chapter Chapter Name
No.
Module
No.
1 | tnrodveson to Big Daa and Hadoop
2 | adoop HOS and Map Reduce
NoSOL
Mining Data Steams
Finding 8
ilar Nem and Clistering
Fal-Time Big Data Models
MODULE 1
Introduction to
Big Data and Hadoop
(a ea wa ena Yor}
tense aa
‘2p Dum emcee Types ig Data
Taos va. Bg Dia business each
‘care in of a aa Sabon
15 Concept Madson
Cre Hadcop Components Haste Econom
Inteduten gl a
i What ig it
Characters og Osa acs
Uo Emantigcn hese ETS
“Tatra versus ca pone...
aso sta Bg Dat Sate.
onc Haein
a lan Fee =
‘pun Core conpenenst Hasse EVER.
lan Hee rie wih hs eho
Solon tOrsacitcve. ETAT.
Hadoop Cannon Package.
aap Ostia Fi Sytem
189 Had Mae nr
184 Yet Armor Rosuce Nogoste YARN)
11 ate YARN?
eon Esso.
+ Chapter Ends,rotor by Solum” of tran
S eabtes 1012 bytes) may be On
ig data i tne ter Tig a i
story, den bards of Ab
on ‘yolume of transnetions in
area ad
aoe as be cored cl orn gorerazent OFS
vas enornoos amount of data through
ata tracing, mobile devo
fee ‘and complex dstasols that are generated sp
rie rel ais oS
a at but ln he tah
poh os
elect, oe in vere forma, and 9
i. 1 ecompanen nat only the
pcs, extract insights from this
Ds tots Whe Ble Da?
Big data rls fo the massive datasets that are cillecbd fromm varia.
sours fr usnes med a revel ne aight fr optimized decison makings
ing to JBM sour, basins and cunsumer life crate 2.5 exabyt
lt er ay. Ii proce tat ata byt 1021 bytes) of data
edu 2015 and 90% thong wil fom the last 5 years, These data
te rd for analy to reveal hidden correlations and patterns which are
ig Data Analytics
* Strpee persona computers (PCs) can al 500 GB of data; it would require 201
Ost areata bytes of data, Google stores dat in millions of server around
re ata in milions of
nlions a nd
Soot ex messages are sent; Faebook has millions off
aud ren shar conten, phon a video,
Ie
lrmation chat
the bases,
+ Big Date Analytin 8 shown in Pig. 1.
bile Computing wing basa
ables: Social Netuorking,
pig the resale of
ting: Mo Ht of three naar trends
tld devices, ug
i itn Pim ag
caret oleate thehardwnr
4p far so
3 ing
whic
con
puting
(on tn cary 2924) 0) =
Br, 4
otto scm oil
challenges and Considerations:
‘Working with hig data presents ooveral challenges:
‘Storage: String massive vol
and Cloud Computing B
i,t: Big Date: Rest of thre computing reds
james of data requires salable and costffertive orase
olutiona ‘Tradkinal databases may not be suficlent, leading t» the adoption of
‘Detrhuted le estas Hko Hadop Distributed Fle System (HDPS) or cloud storage
cptons.
Processing + Procising large-scale data ncemitates parallel and distributed
computing techziquae. Technologies like Apache Hadeop, Apache Sparks snd dais
streaming frameworks are commonly used 19 process big data in a diatibuted and
sealable mantier
‘Analysin + Advanced analytes techniques, incading, machine learning, natural
Tenguage procesng, and data mining, are employed to extract insights from big date
‘These techniques help uncover patterns, trends, end anomalis that would be
challenging t identify using traditional approaches,
‘ust implement robust security mearures, data
comply with relevant daia protection regulations;oat Mu-senr ig Data op). Page 15)
10, Meteorology + Weather sensors and aatlites all over the globe help collect large
‘volumes of data ta track climate conditions, Meteorologist extensively use Big Data
‘atthe pattras of natural disasters, prepare forecast of weather, and the like
11, Bdueation 1 Many educational inttations have embraced the wxage of Big Data for
improving curricula, attracting the best talent, and reducing rates of dropouts by
improving student ovtoomes, targeting eobal recruiting, and optimizing the overall
5 Opportunitas and Impact
cppetunites and impact roa
Big data has the pool to ring sigan
2, Business Insights: Big dats anaitce helps eraniations gain valuable
into camer behavior, market trends, end operational efficiencies. Tt em
sta-rven dacsion making and aid in opinising proses, improving cust student expereneo,
perience ad entitng new business operant So we can say Uy big date represents the vast amount of data generated in our
“laltboare : Big dat anlyis enhance medial research, personalized me «igtal world. It poses unique challenges but also presenta opportuni fr ganizations
ee eels Pare cae oe ‘ind weity asa whole Effectively harnessing and analyzing big data can led to valuable
“ ass fare patient datasets, genomic data, tnsigho, ianevatlon, aod tmnproved deceten-rengin ariens demon
3 moatring kang to beter dieae diagnos, treatment, and prevent
Smart Cities: Hig ata tachnolgies can facta
s the development of smart ei
3 soapsng data om sears, 17 devices,
and soil medi, cites can
energy management, waste mena a
semen, and public nfo
Scientific Research Big date pays
Sa ‘Volume : Big dala lnvlves massive volumes of data that exceed the eapacty of
arenes ries. ona traditional data stage and processing systema i can range from terabytes (10°12
Suen matey, pone ial ‘bytes to petabytes (1018 byte) oF even exabytes (10°18 bytes) and beyond) This
leading to new nights, ‘nmense volume challenges traditional data management techniques.
. Velocity: Big data i generated and cilleted at high speeds, often io realtime or
ear real-time) Seal media stirs, Gancel masks. Iteraet6€ Tongs Col?)
device, and othr aoures pre data at an unprecedented vlocsy he veloity of
big data requires efficient processing and analysis methods te derive timely insights
Variety : Big date encompases a wide variety of data pes and formats It includes
sacred dag. nlioal datbete), enetracerd tag YM ON
snd snalaeé] a
arpa datasets in els
cpp, accleraingiam rays 05) ow og Oat
+A Structured Query Language ($A) 0 neoded to bring the data topather.
Structured dita is easy to eater, query, and analy, All of Use data follows the
same forma. However, forcing 6 consistent strectre alka means that any
alteration of data is too tough as each record has tobe updatd to adhere tothe
now structun,
4+ Beamples of structured data inchade nambers, dates, strings, ete. The business
ata ofan e
pais output for th
sep
© A Rader Tsk preempt os map ak. Sint the map stag all
tasks cecr at these tine, aad they work inde
“ pendently. Th data ie
“tion eda the desired cup The fal reel isa redoced aot fl
‘ale puis which MapRedace, by detail, tars in HOPS,
7%. 243 How Hadoos Map and Redce Work To
The foal opt
Pat We re ekg i
han ad Tack spa ie” OY ON the wed Apach,
First in the map stag, the input data (the six documents) i split and ditebuted
crs tho cluster (the three servers) In this ease, each map tak works on split
containing two dorumenta. During mapping, there is no communication batween the
odes. They perf indopondently.
‘Then, map tasks cron a chey,value> pair for every word. These pairs show how
‘many times a wor occurs. A word is a Ley, and a value is its count. For example, one
Aoeament contains three of four words we are looking for: Apache 7 times, Cass 8
times, and Track 6 imes The key-value pairs in one map task output leok like this:
+ + claws, track. >
‘This proces is dine in parallel tasks on all nodes for all documents and gives a
unique output
After input spliting and mapping completes, the outputs of every. map task
‘reshuffled. This is the frst step of the Reduce stage. Since we are looking for the
‘frequency of occurence for four words, there are four parallel Reduce tasks. The
‘reduc tasks can run on the same nodes as the map tasks, or they can run on any
other node,
‘The shlfl step eaures the ys Apache, Hadoop, Class and Track are srted for the
reduce step. This process groups the vales by keys in the form of
< pins,
Ulan rcs Asi Se
Now tu wa aca ya 29.28 12)ay are ron a date for some
rey wit ar au
ms - sa ded to fits from time to
iy 64 mapas in sa,
Pere avid into duos which ae eel 6
herent ute nodes Moreover,
Spicer hak on! ado frente, 0 9e da
oe ts sak nave. Norm, bh he chk sine and Bi gia
He abe ce ty oe
BBS cn ee is amber sal ile called the mater od
Se that Tr maser ee plied, od a dstry fag
Bea tc ines whee to fal is cope. The dire. eal
so al psp ng th DS av here the dns coplag
dsr. As the processing eompam
oop. The term “MapRadvey?
runt frm. The ithe map joy
ether st of deta, where individual clement
Gas eee men ont matte seats sain
SBR asaods of servers in 3 Hacoos _
{Inthe early daye of Hadop (version 1), Jobracker and Teel Tracker daemons ran
‘nerations in MapRedics. At the time, a Hadoop cluster could only support
“MapRaduce applications
ee | ay sy
ee i
—a
quests to the compute
resources ina aster Since it
‘monitored the execution and
he satus of MapRadar, it 7
‘sided ona master node, | cae] sf
A ToskeTracker proceneed the
2a
-equests that me ffom the
sJobTracka, Fie
* _Alltank trackers were detibuted acromsthe slave node in a Findoop chuater.
‘The tasks shouldbe big enough te jus the tsk basing time Akyou divide a job
say saul ell seats, ho total ie to prepare the opts nod rea te
‘ay outwegh the time nied to produce the actual ob output
notin steatonne sa pia Sliema eet aeAbeer way i itl sanity’ Ad mre machine in th
aril way i pdr nce mo can see the easter rom 10 9
10 efor ty witht any dowating
% ata ttgrty
Data inte fr to the core of data, HDPS ensures data
checking the dt sot he checksum calclated during hg
conta
ete,
Wie le mang ithe stm det nt mach wit dhe orgiad
ois st serape Te eat hen pst roe he
seat DtaN that arpa fat ck. The Nestea
‘he erupted Bak and cute anaitenal ne repli,
Mh races
dnp HOPS slr uid ation
rally oo citer efsons Ringer
Cech tae totes cae To latino hie rol taken ofr
1 Hl iat be sored redundantly, I we didnot duplicate the feat aren
ode ll les woold be naval va he
De ty mt eng nt back up the fsa al nd the dik rab, he
pre laralans fine wal be ot tree ==
i. ei Computations must be divided into.
peat a gh wi ‘ass sich that ify ne tack
(Bets processed. "* application layer and {lst execute to completion, it can
retard wie eiosiny‘jaa ou apace Wana
Aerneyper ener ect dou ran nea oh lll
MspRedase is igo io whch «hae grim it sbdividd ito mal
aod ron parallly t make computation fate, eave time, and enol 9
Morea ev dap od organs ey, vale pe, or ecg
® dictonary you saa fr te word “Date” and is aseciated toeaning fe “uae
‘ati colced together for referesoe or analysis. Here the Key a
the Valor aocinted with iar and sation called together for ref
5 Retecer Ick repenble fr oon dai paral! and produce Goal oupa
‘Morch 1: The map function
1 Porcath sement my of M do
A eS Gia ete pera 0.3.04 my) rk» 138, en we
‘pred Ghy vata) aia (62, 0%, jay) 19
m 4) f= 1,2. Wo the number 6
‘star et fy val pair tat ach by,
‘4,30 forall pose vale of
‘Avot 2: The reduce function
1 Foresch ty, do
Ser aoe bg with Mya ny
‘5 ale bg id Ny ji ing
multiply my dp a valag of
(Ge ha it with vat
2
a
‘4
6
ech ist
a es
ioe tn ling att
fealdrd)
ro matic Ai 29 alee whieh mea the uber of ra) = 2 ad the
cuter felunnat. Matric Bi sana 3x Emr tere mero rvs) =? tod
onter ofl Bach al ft Ayan Be cle in
cars Ais alld Az Lo. 24 row enum. Now ene ati maint hat
‘ope and der. The Fora
Mapper Mates A») = (kA or a
Mapper fr ate B 9) =X) fr als
Tae, eaptng he mapper fr Mac A
* A hdemputa the mba ns os
1 Hoe lle therfore wen k=. can have
1 alo 12a saeco hae 9 artbor
1 vl f= and = Subang al vals
+ nee
ket tetjer G@aary
int Dagan
Le2set @0,4.4,8)
J? @nA20)
List @2,4.1,0
(a,2,4,2.29
i=2 se @a,a.1,9)
Ja? @.2,0,2,09
lsat Makipeaton by Map ace
Computing te mapper metric B
Jel eat @B1,9)
2 @.2,0,10)
i-2 kat 0,09)
4.2,,28
Jat ket @0,0,1,9)
(Wen San masonic yor 2520 MT)aoa
anaLe) ’
era RL ) 21), 48)
jerset nae aa
ST verefore, the final matrix is
urate Map eda ™
Hae) lib re Aa
(A Sma AB Bd
het opting do:
el cers
taveiay
ak it seperate fr Mats A & Apply condition cto each tap in the relation and produce ax output only those
coplesthat sty
‘Te al ofthis election ie enotd by eR)
Seton realy dot not ned the fll power of Mapedsce.
‘They can be done most conveaienly in the map portion alone, thou they
could ss be done inthe reduce portion als.
‘The prea code i ows:
Map Gay valve)
fortapein valve:
‘Fpl atin C=
13)04,2.4)) it apt, pl)
2.1.9,18.2.7) Raine ey, aes)
Now AB (3°) (6) ag ese
22) Aue, 1,3,(4, 2,0) Projection
Peet 8)8.2,0) for sme suo s af the attribute ofthe relation, produce om each tp only the
Now Aix 18) 04 ag &) compments for tho attribuen a
‘The reutof this proeton is denoted TT)
8 with dining value ken fom
Mapper step shove
Matric VacorMalipicsion by May Rau,
4.1) AieetA.1.0,4,2,29
Bul, 1,5,08,2,7)
Now Avie: 105) + (27)) «19 09
1,2) Ate, 1,004, 2.29,
Bis, 1, 6,8, 2,8)
Now Aix: (1"6)-+ (298)} 20
From, i ana
i ac Gwe ache
that
Hee Projection is performed similarly to selestion.of the four tks
Redace sags, etch
Taal kev pair. The redice task also ap
the cam ie ed wick independent
ram, th rd asks gt the Plowing
1. Wide dts ype vay
+ Webco athe runing a iniert faring eros 12 en lrg inary chic fr rng a
2 Ditribtnd ao
+ Mote oS dats can be eset strated fashion 23, Balk plod
+ Os atscing a ice cape 14. Lower aii
+ Oe ACID concep abn acon fr ealbility and Uaroghpat 15, Dsicted toate
+ Mostiy ne ayzehronoos replication betwen ditrbated nodes Asyachronai®| | 15 yal ime salle
MultiMaster Replication, po opr, IDS Replisstion |
Only providing eventual oni
to SAtrneticranicyors2 7c) — (Brannoranatin