ra52015
Haoop Tutorial - YDN
‘Search YON
ean Fick = Mobla More
Bm
Search My Apps
Apache Hadoop
Module 4: MapReduce
Previous module | Table of contents | Next module
Introduction
Documentation Forum
MapReduce isa programming model designed for processing large volumes of data in pall by diving the
Work ino set of independent tasks. MapReduce programs are writen ina particular syle influenced by
fuaretional programming constets,spectaly kms for processing sts of data, This module explain the
nature of tis programming model and how i ean be used to we programs which run i the Hadoop
Goals for this Module:
+ Understand functional programming at applies to MapReduce
“+ Understand the MapReduce program flow
+ Understand how to wie programs for Hadoop MapReduce
+ Loam about ational features of Hadoop designed toa software development
Outline
Inosucton
Goals for this Module
Outing
Prerequisites
MapReduce Basics
Functional Programming Concepts
List Processing
Mapping ists
Reaucing Lists
Putting them Together in MapReduce
‘An Example Application: Word Count
“The Driver Mato
65, MapReduce Data Flow
1. A Closer Look
2. Aécitional MapReduce Functoraty
‘2. Fault Tolerance
7, Chockpoin
8. More Tine
1. Chaining Jobs
2, Troubleshooting: Debugging MapReduce
3. Listing and Kling Jobs
8. Acaitiona! Language Support
1. Pipes
2. Hadoop Streaming
10, Conclusions
11. Solution to Inverted Index Code
Prerequisites
This module requires that you have Sot up a us envionment as described in Module 8. If you have not
stready contguted Hadoop and successully run the example applications, go back and do so now.
MapReduce Basics
FUNCTIONAL PROGRAMMING CONCEPTS
MapReduce programs are designed to compute largo volumes of data ina parallel fashion. This rogues
‘ving the werkioad across a large numberof machines. This model woul nt scale to large clusters
(hundreds or thousands of nodes) ithe components were allowed fo share data arbitra. The
‘communication averhead requ to Keep the data onthe nodes synchronize at allies would prevent the
systom from performing relay or efciontly at large seal.
itps:tdeveloper yahoo.com hadooptutor aimed. ml
4sansaois Haoop Tutorial - YDN
Instead, al data lomons in MapRedice ar immutable, meaning that they cannot be updated. Hin a
‘mapping ask you change an input (key, value) pa, does rot get flected back In the input fs
‘communication occurs only by generating new ouput (key, valis) pais which ae then forwarded by the
Had system into the next phase of execution.
LIST PROCESSING
Conceptual, MapReduce programs transform iss of input data elements no Hts of cup data elements
[A MapReduce program will doth twice, using vo diferent st processing iioms: map, and reduce. These
toms are taken from sovaral Ist processing languages such as LISP, Scheme, or ML
MAPPING LISTS.
Te fst phase of « MapReduce program is called mapping Ast of data elements are provided, one at 3
time, toa function calla the Mapper, whic transforms each element incviully to an output data element
Input ist
Mapping function
Output list
Figure 41: Mapping ereaies a new ouput st by applying a function to ndvidual elements of an input st
‘Asan example ofthe uty of map: Suppase you hada function toupper st) whieh tums an uppercase
\etson cf ita input strng, You could use ti function wrth nap totum a tof stings it alist of upperease
Strings. Noto that we ae net mealying he input sing hor: we are rtuing a now sing that wil frm par of
a now output Ist.
REDUCING LISTS.
Reducing lots you agorogate valos togsther. A reducer function receives an
for of input values from an
input st. It then combines these values together retuming a single output value
input ist | [][
Reducing function
Output value
Figure 4.2: Reducing a Ist terates over the Input values fo produce an aggregate value as output
Reducing is often usec to precuce “summa” data, tuning a large volume of cata nto smaller summary of
sol. For example, "=" can be used as a reducing function, to rtum tho sum ofa Ist of input values.
PUTTING THEM TOGETHER IN MAPREDUCE:
The Hadoop MapReduce framework takes these concepts nd uses them to process large volumes af
Infomation. & MapReduce program has two comparenis: one that implements the mapper. and ancther that
implements the reduce. The Mapper and Reducer isome described above are extended sigh to work in
this ervzonment, but the basic principles are the same,
Keys and values: In MapRedice, no value stands on ts ou. Every vale has a key associated witht. Keys
Identity related values. For example, alg of time-codes speedometer readings trom multiple cars could be
eyes by lconse-plat number i woud lo ik:
444-123 6580, 12:0008
Macias kaopn) 12:05pm
itps:tdeveloper yahoo.com hadooptutor aimed. ml
aara52015 Haoop Tutorial - YDN
“The mapring ar reducing functions receive no just values, but (key, vas) pas. The output of each of
those functions isthe samo: both a Key and a valve must be emited to the nox tin the dats Now,
MopReduce is aso lass strict than other languages about row the Mapper and Reducer wor. In more forma
funetonsl mapping and redvcing stings, @ mapper must produce exactly one output oloment fr each input
clement, and a reducer must produce exactly one output element for each input Ist. In MapReduce, an
abitrary numberof values can be ouput from each phase; & mapper may map one input into ze, ons,
cone huncres opus. & reducer may compute over an input Fat and emit one ra dazen differen output.
Keys divide the reduce space: A reducing function tums a large ist of values into one (ra few) output
value, In MapReduce, all ofthe outst valies are not usualy reduced together. Al fthe values with he
same key are presented to a single ducer together. Ths is performed independently of any rece
operations occuring on other ists of vals, with deren Keys attached,
Rae?
Figur 43: Ditforont colors reprosort diferent keys. Al values with tho same Key aro presented toa single
reduce task
[AN EXAMPLE APPLICATION: WORD COUNT
‘A simple MapRedce program can be writen fo dtemine haw many mes diferent words appesrin a set of
fs. For examplo, if wo had the os
oo. Sweet, his is the foo fle
bart: This isthe be fle
Wie woud expect the output tobe:
bar
"Naturally, we can write program In MapReduce to compute this ouput. The highfovel structure would lok
ket
rapper (filenan
cale tora,
fle-concental:
ducer (word, valves!
fenit reeds a2)
Listing 44 High-Level MopRedce Word Count
Soveral instances of the mapper function are creatod onthe iflornt mactinas in our cluster. Each instance
receives a étfoont inp le (ts assumed that we have many such fle). The mappors op (word, 1)
ars which are then forwarded tothe reducers. Several instances ofthe reducer method ae also istariited
lon the ferent machines. Each reducer is resparsible for processing the Ist f values asscclated wih a
ferent word. The tof values wil be a st of 13; the reducer sums up thse ones int a final count
associated witha single word, The reducer then emits the ial (word, count) output which i waten to an
output te.
Wo can write avery similar program to this in Hadbop MapReduee is included inthe Hadoop cistrbution in
‘sc/exanples/org/apache/nadoop/exanples/MordCount. java. It's paral reproduced below:
implements Mapper {
itps:developer yahoo.com hadooptutor aimed. ml anaras2015 Haoop Tutorial - YDN
private Text word = new "ext?
public votd cep(Lonzlirstable key, Text vaiue,
Sucpuccol lestorstewey Intirstable> ou
Repseter repocter) chrows 10sxcuption
Steing Line = vsise.sostring()
Steingrokenteer ite = new stelngtoxentzer (Line):
while! erenssviorsioxens ()) |
Wword.cot erwaextTocen(
Sufputicettecs woes, one)?
i
{Jp reouver class chat just entts the sum of the input values
public static class Reece extends Maphadvoatace
inplonents ReducercTens, Inteitable, Text, atiziteble> {
public vold reduce(axs fay tueratorctotiricable> valuty
. ag, Mporter teporter) ehrows TORxception |
while (values hastext())
sun ov values nent) get
Gutput collect (xey, new TntWeitale (sum)
'
Listing 4.2: Hadoop MapReduce Word Count Source
‘Tre are some minor dtfeences betwoon tis actual Java implementation andthe pseudo-codo shown
above. Fist, Java has no native enit Keyword Ihe QutputCollector objet you are given as an input will
receive values to emit to the next stage af executon. And second, the default input format used by Hadoop
reser each line of an input fleas a separate input othe mapper function, nt the entree at atime. It
aso uses a Stringrokentzer choc o break up tho Ine into words. This doesnot porfrm any nowmalization
ofthe input, 50 "cal, “Ca” and "ea." ar all regared as eiferent stings. Note that the lass-varable word is
reused each time the mapper outs another (word, 1) pig, this saves time by no allocating a new
vatale for each output. The output collect{) method wil copy the values It receives as input, $0 you a
{roe to overnta the vanabls you
THE DRIVER METHOD
‘There is one fral component of @ Hadoop MapReduce program, called the Diver. The aver ialzes the job
and insite the Hadoop platform to execute your code ona set of input fas, and cotols whe the output
es ae placed. A eteanea-up version ofthe diver ram the example Java implementation that comes wth
Hadoop is presented below:
Publie void run (String inpatPath, String ostpucPach) throws Exception (
1) whe keys are words (steinge)
ont, setoutpuckeyciass (text clase)
FP he selves aze counes (ante!
Conf: sotostpucvaiveciaas {inehritable.class) +
conf sotusppesCiass ihapciasa.clase)
Seni setReaucesClase (teduce.ciaa0)
Esletnpotfornat.addraputPath (oonf, new Path inpuceat3))?
PilcostputPorsa-setouepuePach icone, ew Pach (euspet Zac}
SoveLiene.runzee (cont
sting 4.3: Hadoep MapReduce Word Count Driver
This method sets up a job fo execute the word count program across all he files n a gven input cectory
(the inputPath exument). The output fom the educors are writen int losin tha directory identi by
‘outputPath. The configuration information tora the jobs caplred in the JobConf object. The mapping and
reducing functions ar identified by the setMapperClass() and setReducerClass() methods. The data
types emit bythe reducer ae identfiod by setOutputkeyClass() and setOutputvalueClass(). By
etal, its assumed that these re the output ypes ofthe mapper as wall. this i not the ase, the
methods settapovtputKeyClass() and setMapOutputValueclass() methods ofthe JobConf class wil
‘verde these, The input types fed to the mapper are controled by the InputFormat used, Input formats ae
ciscussed in mote deta below. The default input format, “TextinutFoma,” wil oad data in 38
(Congiabie, Text) pars. The lng value iste oye offset of he linen the fl. The Text abject lds the
sting caters ofthe lne of he fle,
‘The call 0 206C1Lent.runtob(conf) wil submit the jo to MapReduce. This cal wil block unl the job
completes. I tho job fai, wl how an IOException. JobClent also provides a non-blocking version called
svonit0b()
itps:tdeveloper yahoo.com hadooptutor aimed. ml asansaois
MapReduce Data Flow
"Now that we have een the components that make up @ baste
works together at higher lave
Haoop Tutorial - YDN
MapReduce jo, we can see how everything
Values exchanged
by shuffle process
nese pei cnaGD| aap
womens | A | | Coc Aon
Node 1 Node 2 Node 3
cece ee i ele
poaipicoy | | Gera | | bebe
oc | aim || om | | on
he
Figure 44: Highiovel MepReduice pipeline
MapReduce Inu ypially come fom Input les loaded onto our processing cluster in HOF, Theses are
every estrbuted acres al ow aades. Running a MapReduce program involves runing mapping tasks on
[Each ofthese mapping tasks ie equivalent na mappers nave
Patcuar identities” associated with them, Therefore. any mapper can process any input fle. Each mapper
many oral ef the nodes in our clus
toads the so of files focal to that machine and process thom,
‘When the mapping phase has completed the ntamesdat (key, vase) pire must be exchanges between
‘machines to sonal values with the same key to single rodicer, The ede tasks are spread crass the
sme nodes in he cluster asthe mappa. This i the only communication step in MapReduce. individual
‘map tasks do not excnange information wth one anche, na ars thay aware of one anthers existance.
Sima, ferent reduce tasks do nat conrmuricate with one anoher. The user never expicily marshals
information from one machine to aather al data transfer is handled by the Hadoep MapRedce platform
sal, quidodimplcnty by the fen koys associated wih value, This sa fundamental element o Hadoop
MapReduce'sralablty. I nodes inthe cluster fl tasks must be able o be restarted, I they have been
performing side-afects,e., communicating wih the cuside wot, then the shared state must be restored in
‘restarted task, By elminaling communication and sio-ffects, restarts can be handled more gracofly
ACLOSER LOOK
‘The previous figure described the hightevelvew of Hadoop MapRedice. From tis dagram. you can =
whore the mapper and reducer compenens othe Word Count application ft in, ana how it achiowes is
objective, We will ow examine hs syston in a Bt closer dtl,
itps:tdeveloper yahoo.com hadooptutor aimed. ml
54sansaois Haoop Tutorial - YDN
Node 4 Node 2
pute | ‘nputormat 4
f=]
re ye 2 =
wunven | 4 4 i Pevere |
[meni LY NT esc
: eS
“ i
Figure 4.6: Detaled Hadoap MapReduce data Now
Figure 4.5 shows the ppsine wih more ofits mechanics exposed. While only two nodes ae depicted, tho
‘same pipeline can be replicated across a very large numberof nodes. The next several paragraphs describe
‘ach o the stages of a MapReduce program more pocisaly.
Input files: This is whore the data fora MapReduce task i initly stored, While this does nat ned to be the
caso, the put les typiealy reside in HOF, The format of thos fies i arbiary: wile ne-bated og les
can be uses, we could ls0 se a binary format, mune ip records, or someting else etily. is
typical fr thas input fs to be vary large ~ tens of gigabytes or more
InputFormat: How these input fle are spt up and read is defined by the Inter. An InputFormat is @
‘lass that provides the falling functionality:
+ Selects tho files orator objcts that shou bo used for input
+ Defines the inoutSpits that break fo into tasks
+ Provides a factory or Recerdeader objec that rad the le
Several InutFormals ere provided wth Hadoop, An abstract lye is called File!nutForma; all npulFormats|
that oporat on fs inherit functionality ae properties from this class, When starting a Hadoop jo,
FlelnputFormat Is provided wth a path containing les to read. The FeinpuFormat wil red ll le in ths
ractory.Ithon divides these los ino one er more InputSplts each. You can choose which InputFormat to
‘apply 1 your input files for ajo by calling the setnputFornst() method of the JabConf abject that defines
tho job. A tab of standard InputFormats is given below.
InputFormst: Description: Koy: Valve:
Texting ormat Dofaut format; rads nes of text Tho byte offset of the ine |The line
files contents
KeyValucinputFormat Parse line into key, val pars Everything up tothe fst The remainder
tab character ofthe ine
SequencsFlenpulFomst A Hadoopspectic high sersetined serdetined
perfomance nay format
“Table 4.1: IpulFermats provided by MapReduce
‘The default IngutFormat is the TextinputFormat, This treats each line ofeach input fleas a separate recor,
and perfor no parsing, Th weful or unformatted dala of Bne-based records Ike lg les. A more
lereting input format i te Key ValusinptFormat. Th format alo treats ach tne of input 9 a separate
record, Whi the TextnputFormat rats the ents ing a6 the value the KeyValelnputFormat beaks tne Ene
lisafnt the key and value by searching fo atab character. This is particularly use fr reading the output
of ene MapRoduce job as the input to another, as the default OulpulFormat(Geserbed in more deal below)
formats ts result in this manne, Final, the SequencoFeetFomal ads special Day es that are
specie te Hadoop. These fles include many features designed to allow data tobe rapidly read nso Hadoop
itps:tdeveloper yahoo.com hadooptutor aimed. ml ansansaois Haoop Tutorial - YDN
‘mapper. Sequence files are block-compressed and provide direct seralzation and deseilization of several
abitary data types (ot just text). Sequence fles canbe generated asthe output of other MapReduce tasks
fd ar an efcient ntrmecate representation for data thats passing from one MapReducs ob to anther.
InputSpilts: An InpulSplt desenbes a unt of work that comprises single map task na MapReduce
program, A MapRecuce progam aplied to a dataset, callctvely refers to a8 2 od, made up of several
(possibly soveral hundred) tasks. Map tasks may involve reaing a whole fle; thay often involve reading erly
at of fil, By taut, the FieinplForat and its descendants bak a fie up ilo 64 MB chunks (he
same size as blocks In HDFS). You ean contol this value by setting the napred. in. split.size parameter
In nadoop-site.xnt, orby overriding the parameter n the JobCon# eoject used to submit a parieular
MapReduoe job. By processing a fle in chunks, we allow several map task to operate ona single len
paral Ifthe flo vory largo, this can improve performance sigificandy through paalelsm. Even moro
Importantly, since the various blocks that make up the fle may be spread across several aiferent odes in
the cluster, allows tasks tobe scheduled on each of these diferent nodes; te vidual Rocks ae thus all
processed locally, instead of needing to be transfered from one node to ancher. Ofcourse, wl og fas
‘an be processed i hie piecesnse fashion, some fil formats are nat amenable to chunked processing, By
wating a custom InputFormat, you can contol how the fe is broken up (ors not broken up) it splits.
Custom input formals deserted in Meu 5
‘Tho InutFormal defines the Histo tasks that make up the maging phase; each task conasponds toa single
Input spi. The tak are then assigned tothe nodes in the system based on where the inl fe chunks are
Plysically resident. Aninvidual node may have several dozen tasks assigned tot. The node wil begin
‘working 0 the tasks, etlompting te perform as mary in parle as it can, The encode paralism is
controled bythe napred. tasktracker.nap, tasks.naximue parameter.
RRecordReader: Tho InputSplthas defined a slice of wetk, but doos nt describe how to accoss it. The
FRocordRoader class actualy oad te daa frm ts source and converts it into (key, vale) pars sable for
reading by the Mapper. The RecordReader instance Is defined by the InptFoemat. The default IputFormt,
TextinputFomat, provides 9 LineRecoraReader, which eats each ine ofthe input fe 9s new vate. The
key associtod with each Ino ts byte offset inthe fle. The RecordReador is invoke roptedly on tho input
Unt the ani put Split has been consumed. Each invocation ofthe RecordReader leas to another call to
tho nap() method of te Mapper.
Mapper: The Mapper performs tho interesting usar-tnod work ofthe fst phase o the MapRoduce program.
{Given a key and a valu, the nap() method emits (ke, value) pas) which are forwared fo the Reducers. A
raw instance of Mapper instantiated na separate Java process foreach map task (IngutSplt tha makes
Lup ar ofthe total jb ng. The individual mappes ae intertionally not provided wth a mechanism to
communicate wth ene anctherin any way. This allows tho realty of each map task tobe govern solely
by the relabity ofthe local macrine. The map() method receves two parameters inaction to the Key and
+ The OutptCallactr object has a method named colect() which wil forward a key, value) pair tothe
reduce phase ofthe joo.
+The Reporter objet provides infomation about the curent task; ts getEnputSp14C() method wl tum
an object deserting the curent InputSelt a allows the map tas to provide actions information
out ts progress tothe rst ofthe system. The setSeatus() method allows you to emit a status
message back tothe user. The incrCounter() method allows you to increment shared performance
Counters. You may define as many arbitrary couriers as you wish. Each mapper ean increment the
Counters, and the JobTracker wil elect ine increments made by the different processes anc aggregate
thm for lterreireval wen the ob ends.
Parton & Shu: Aftor tho fst map tasks have completed, the nodes may stillb pearing several
‘more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to whore
thoy are requlea by the reduces. This process of moving map outputs to the reducers fs known a8 shutting
Aaferent subset ofthe intermediate key space is essigned to each reduce node; these subsets known as
“prttons) are the inputs to the reduce tasks, Each map ask may emt key, vale) pas to any parton: all
‘valves fr the same key are always reduced together regardloss of which mapper is is origin. Therefore, the
‘map nodes mut all agree on where to send he fren pores of the intermodite data, The Patton class
determines hic partion a given (key, value) pa wil go to. The default prions computes @ hash value
forth key and assigns the partion basad onthe result, Custom pattioners are deserbed in mors deal
Module 6
Sort: Each reduce task is responsible for reducing the values associaed wih several intermediate keys. The
set of intermediate keys ona single neds is automaticaly sored by Hadoop bofrs they are preserid tothe
Roser
Reduce: A Reducer inslance is created foreach reduce task, This is an instanceof user provided code that
prlorms the second important pase of ab-spactc wrk, For each Kay in the patton assigned o 8
Reducer, the Reducers reduce() method is called once. This receives a key a5 wall as an eater over all
the values associated withthe kay. The values associated with a key are returned by the erate in an
Undefined order. The Rader also raceves as parameters QufpCallectr and Reporter objects: thoy aro
Used in the same manner asin the nap() method
‘OutputFormat: The (key value) pas provided ta this OuputCollector ae then writen fo output flee, The
way they are writen fs governed by the Output omar. The OxtputFormat functions much tke the
InputFormat class deserbed eater. The instances of OutputFenmat provided Yy Hadoep wie to fles onthe
local isk orn HOFS; thoy all ihert om 2 common FleOutputFormat. Each Reducer writes a separate file
itps:tdeveloper yahoo.com hadooptutor aimed. ml
m4sansaois Haoop Tutorial - YDN
in a commen ouput rectory. These fies wil iypically be named part-nnnnn, where non isthe patton ié
associated withthe reduce task. The ouput rectory is sel by the FileOvtputForwat. setoutputPath()
method. You can contol woich patcuar OuptFomat is used by cag the setOutputFormat() method of
tha JobConf object that defines your MapRaduce ob. A table of provided OutoutFormats is oven below.
OutputFormat Description
TextOutputFormat Default wits Ines in “key Ut value form
SequenceFleOutp.tFormat Whites binary fles suitable for reading no Subsequent Moped jobs
INulurputFomat Disrogards ts inputs
Table 4.2: OutputFormats provided by Hadoop
Hed provides some OuipulFarma instances to uite to fle. The base (Seat) instance is
ToxtOutpuFormat, wich wits (key, vale) pais on indviual nes ofa tot file. This can be easily road
by a ater MapRoduce task using the Key alualnpctFomat class, and is aso humanreadabe, A beter
Intemediatefomat fr use between MapReduce jobs Is the SequenceFileOutputFonnat wich rapily
sefiazes arbitrary datatypes to the fle; the corespening SequenceFllngutFomat wil dessrilize the fle
into th same typas and presens the data tothe nex! Mapporin the samme manner asi was emitted by tho
provious Reducer. The NulOutpuForat generates no ouput fles and disrogards any (key, value) prs
passed tol by the OutptCatlector. This useful if You are explety writing your om outs es in he
rreduce() method, and do nat want adationsl empty output Fes gonerted by the Hadaop framework
RecordWiter: Much ike how the InulFermat actualy rads Invi records through the RecordReader
Implementation, the OutputFomat class isa factory for RecaraWiter obec; these are used to wite the
Incvidual records tothe les as rected bythe OutputFormat.
‘The output files wnten by the Reducers ar then left in HS for your use, ear by ancther MapReduce|
|b, @ separate progam, for for human Inspection
ADDITIONAL MAPREDUCE FUNCTIONALITY
Node 1 Node 2
Iroc voars | | a freemen
ie | [me | [me mp | [mee
mating / remedaaee
a Siete macs)
‘sma kv
‘asexchorged
ranean
gre 46: Combiner stop inserod into the MapRodoce daa low
‘Combiner: The pipeline showed eater omits a processing stop which can be uted for optimizing anit
usage by your MapReduce jo. Called the Combinor, this pass rus afr the Mapper and befor the Reducer.
Usage of te Combines optional I this pass Is sutabe for your jb, instances of the Combine lass ae
run on every node that has run map tasks. The Combiner wil receives input a data emitted bythe Mapper
instances on a given node. The ouput fom the Combiners then sent tothe Reduces, isisad ofthe output
trom the Mappers. Tho Combiners a "minixeduco” process which operates only on data goneratod by
fone machin
‘Word count isa prime example fr whore a Combiners useful. The Word Count program in stings 13 emis
(word, 1) pat for every instance of every word i sees. So he same document contans the werd eat” 3
tins, the pale "cat, 1) ls emvted thee times; all ofthese are then sent to the Reducer. By using a
Combiner, these can be condensed into a single ("cat™, 3) pairto be sent tothe Reducer. Now esch node
only sends a single value to the reducer fr each word ~erasticaly reducing the total bandwidth cequed for
tha shuttle process, and spooding up he job. The best pr of als that we do not nood to wits ary aditional
code to take advantage of thst If reduce function is bath commuative and associative, thon kt can be used
‘8 a Combiner as wal, You can enable combining in the word count program by ang tho following He to
the driver
itps:tdeveloper yahoo.com hadooptutor aimed. ml
ansansaois Haoop Tutorial - YDN
‘The Combiner shoul bean instance ofthe Reducor interface. I your Reducer isl cannot be used érocty
1 a Combiner because of commutativly oF associativity, you might Sti be able to write Hd clas to use
‘8 a Combiner for your job
FAULT TOLERANCE
(One ofthe primary reasons to use Hadoop o run your jbs is de Lots high dogree of fault tolerance. Even
when unning obs ona age cluster whee individual nodes or network components may experience high
rates of flu, Haoop an gui jobs toward a successful completion.
The primary way that Hadoop achieves fault tolerance is though eslaing tasks, Invidual task nodes
(Tasktrackers) ae in constant communication wih the heed node ofthe system, called the Job Tracker. I a
TaskTracker als to communicate withthe JobTracke or a peiod of time (by detaut, 1 minut), the
JobTracker will assume that tho TaskTracker in question has crashod, The JobTrackor knows which map and
roduc tasks wore assigned to each TaskTracker
IF he job stil n the mapping phase, then other TaskTrackors wil be asked 'o r-xocue all map lass
proviously run by the sled TaskTeackr, Hf he jo fin the reducing ag, than orer TaskTracker wl
‘execute al reduce tasks that were In progress on the fall Task Tracker.
Resluce tasks, once completed, have been wien back to HDFS, Thus, ia TaskTracket has aldy
completed two out of tse reduce tasks assigns tf rly the tie task must be execited elsenher. Map
tasks are slightly more complicatee: even a nade has completed tan map tasks, the reducers may not have
al copied thelr inputs from tho cuputof those map tasks. If node has crashed, thn ts mapper outputs are
inaccessible. So ary aready-completed map tasks must be re-executed to make thei results available othe
rest ofthe fesulng machines. lof his handed automatiaty by the Hadoop plato.
‘This fault tolerance underscore the noed for program execition tobe side-tec ree. If Mappers and
Reducers had individual idertites and communicated wih one anther or the outside word, then restarting a
task would requ the cher nodes fo communicate wit the new instances ofthe map an recuce tasks, and
the r-executed tasks would need to reestalish thelr intermediate state. This process is notoriously
complicated and erorprono in tho general case. MapReduce simpitis this problem srastialy by eliminating
task lentes or the ably or ask patton to communicate wth ene another in naval task sees ony
its own cret inputs and knows only is oun outputs, to make tis flue and estar process clesn and
opondabl
Speculative execution: One problam with the Hadoop system is thet by dividing the tasks across many
nodes, itis possible fora few slow nades to rates tn rest of he program, For example fone nade has @
low disk controler, then it may be reading its input a only 10% the speed of al the other nodes. So when 99
‘map tasks ar sready complet, the system is stil wating fr the inal map task fo ehack in, wich takes
much tongr than athe other nodes.
By forcing asks to nin slain rom one anata, individual asks do nol know where thor inputs come
from, Tasks tnt the Hadoop platform to ust dlver the appropriate input. Therefore, the same put ean be
processed mutiple times in paral, to expat ferences in machine capabilies. AS mast of the tasks ina
jn are coming toa cle, the Hadoan platform wl echedle redundant copies of te reining tasks across
several nodbs which do nt have other work to perform, This process is known as speculative execution
‘When tasks complete, they announce this fatto the JabTracker. Whichever copy of @ ask tshes fst
bacomas th definitive copy. other copies were executing speculatively, Hadoop tells the TaskTrackers to
abandon tho tasks and eiscard ther cuputs. The Reducers then receive thei inputs trom whichever Mapper
comploted succossfly, frst
‘Speculative execution fe enabled by defau. You can csable speculative execiton fer the mappers and
reducers by soting the mapred.nap. tasks. speculative. execution and
rapred.reduce, tasks, speculative, execution JobCor options to Fase, respect,
Checkpoint
You now krow about al of the basic operation of the Hadoop MapRedvce platform, Try the following
exereise, to 8 you understand he MapRedues programming concepts
Exercie: Gven the code for WordCount in stings 2 and 3, mod this code to predic an inverted index of
ls Inputs. An inverted index tums a list of documents that contain each word in thase documerss. Thus, i
the word “eat” appear in documents A and, bu nt C, then the line
should appear inthe ouput I the word “aseba” appears In documents B and C, the the tne
should appear inthe ouput as wel
itps:tdeveloper yahoo.com hadooptutor aimed. ml ansansaois Haoop Tutorial - YDN
It you gt stuck, road the section on trubleshootng below. The working Solution is provided at tho endo his
Hint: Tho detautInputFormat wl provide the Mapper with oy, val} pars whore tho Koy is tho bye offset
Int the flo, ae the val is eof tex, Ta get the flename ofthe cure input, use the fllowing code:
More Tips
CCHAINING Joa.
"Not every problem can be slved with a MapReduce program, but fewest
with a single MapReduce job. Many problems can be solved wth MapRecuce, by wing several MapReduce
stops wich rn in series fo accomplish a goal
those which can be solved
Mapt > Reducet > Map2 > Redueo2 > Map
‘You can easly chain jobs together in his fashion by writing muliplo diver matheds, one foreach jb. Call the
frst ever rethod, which usos 20bCLient.rundab() 16 un the jb and wail foril to completo. When that job
has comploted, then cal the next evar method, which erates 8 new JobConf abject refering to diferent
Instances of Mopper and Reducer, et. The fst jb nthe chain should wre ts utou to path whichis then
used asthe input path forthe secand jb, Tis process canbe repeated for as mary jobs are necessary to
anive ata completa solution to tho pobiom,
Many protlems which at fist sasm impossible in MapReduce can be accomplished by cvidng one ob into
two omer.
Hadoop provides anatner mechanism for managing batches of obs wth dependencies between jobs. Rather
than submit @ 208Con¢ tothe 2o8Client's run2e0() of submiz30b() methods,
org. apacke.hadoop .mapred, jobcontrol Job objects an be created to represent each Job; A Job takes a
DebCon¢ ebect as its constructor argument. Jabs can depend en one anther trough the use of he
‘adoepensing20b() method. The code
says that Job x cannot stat unl y has successfully completed. Dependency Information carrot be added to
job aftr it has aeady boon sarod. ivon a Sot of jbs, these can be passed to an instance ofthe
doBcontrol class JobControl can reel invidual jobs via the addI9b() method, oF a eaten of jobs
ia acazobs(). The 1obContrel ject wil spawn a thread inthe lant to launch the jobs. awl obs wl
be launched when ther dependences have all successfully completed and when the MapReduce systom as a
rola has resources to exscute the jbs. The JobCortal interface allows you to query if retiave th stato
of ndvcal jobs, a8 wal a the lt of obs wating, ead, runing, a finshed, The ob submission process
oes rot begin unt the run() mathod ofthe JbControl abject calles
‘TROUBLESHOOTING: DEBUGGING MAPREDUCE
\Winen weting MapRedce programs, you wil occasionally encounter bugs in your programs, infinite loops, et.
Tis section dezerbes the fostus of MapRecuce hat wl hap you dlagnose and soWve these contons
Log Fites: Hadoop keeps logs of important events curng program execution. By default these are stored in
the Logs/ subdirectory ofthe hadoop-versfon/ dgctory where you ran Hadoop from. Log es are named
hadoop-usernane-service-hostnane. log. The most receet datas inthe -1og fle; clderlogs have thelr
ate appended o them. The userame in the lg flerame refers tothe usemame und which Hadoop was
startod ~ this isnot necessarily the same usemame you ae using to un programs. The Service rame refers
to wich ofthe several Hadoop programs are weting hetog; these can be joiracke, ramenede,datanode,
seconaarynamenode, or tasktracker. All ofthese ar important for debugging whole Hadoop nstalaion. Sut
forinaviual programs, the taktracker logs wil bs the most relevant. Any excoptions thrown by your program
will bo recorded in tho taskirackor logs
‘The log diectery wil alsa have a subactory called userlage, Hee thee is another sublracton for avery
task nin, Each ask cords its sut and stro two files in hs rectory. Note that on a muli-nose
Hadoop cluster, these logs are rot centrally agarogated - you should check wach TaskNode's
legs/usertogs/ arctory for ther output
Debugging in the distributed setting Ie complicated and rqules logging into Several machines to access og
data. I possible, pograms should be unt tested by runing Hadoop local. The default contiguation
deployed by Hadoop run in “singl nstance* made, where the ene MapReduce progam is run nthe same
instanceof Java as called 2obC1 ent runob(). Using a dabugge like Eclipse, you can than
breakpoints inside the rap() or reduce() methods to eiscover your bugs.
itps:tdeveloper yahoo.com hadooptutor aimed. ml 104ra52015 Haoop Tutorial - YDN
In Modul 5, you wil lear how to use additonal ealures of MapReduce Io Sstrbute auxiliary code Lo rods
inthe system. This can be use to enable debug scripts which run on machines when asks fal
LISTING AND KILLING JOBS:
Its possible to submit jobs toa Hadoop cluster which malfunction and send themselves into infinite loops or
cher problematic states. In this case, you wl want to manual kil the jb you have started,
‘The flowing command, rin inthe Hadoop installation directory ona Hadoop cluster. wl tal he current
jobs:
3 bin/hadeop job ise
This wil produce output that looks something Ike:
SQ HE Ease
You can use this obi oki he jb; the command Is
‘Subsite th ob_2028..." fom the -1ist command for job
Additional Language Support
Hadoop ise is wren in Java: it thus accopis Java code natively fr Mappers and Reducers. Hadoop aso
comes wth two adap ayers which allow cade witten in thar languages tobe used in MapRedvew
programs.
PIPES
Pipes isa Ibrary which allows C++ source code tobe used for Mapper and Reducer coe. Applications which
reauie igh numa! performance may see beter throughput if won in C++ and used though Pipes, This
‘ovr is supported on 32-bit Linx instalatens.
‘The inclee files and stati varies are presont inthe c+¥/Linux-1386-22/ dretory under your Hadoop
Irstlation. Your apptcation should include include hadoop/Pipes.h and TeaplateFactory.nh and nk
‘against 11b/11bhadooppies.a; wih gc, include the arguments -L${#ADOOP_ HOME) /c++/Linux-1386-
32/1ib -Lhadooppipes todo he lalter.
Both key and valu inputs to pipes programs ae provided as STL stings (std::string). A program must
Bill doin an instance of Mappor ans Reducor these names have not changod, (They, ko al cher classes
Aetna in Pipes, arv in the Hodo0pPipes namespace.) Unlixe the clases of tho same names in Hadoop
set, the pap() and reduce() functions tke na single argument which 6 @ reference to an object of type
MapContext and ReducoConext respectively. The most important methods contained in ach af these
context objects ae
const stds setrings gecinpusney (i
Genet Seat isesings gertnpurvalue()
Wold oriticonst sedi ietringe Key, const std::strings value)
‘The RecuceContaxt clase alo contains 2 actional method to avance the value erator:
bool nextVatuet)
Defining a Pipes Program: A program fo use with Pipes is deine by wing classes extending Mappor and
FRoducr. (An optionally, Parone, soe Mori 5.) Hadoop must then bo informed whch classes to use to
rn the jo.
‘Aninstance of your C++ program wil be stated by th Pipes framework in main() en each machine. Ths
should do any (hapetly brit) eoniguation requred fr your task. Ht should then define a Factory to ereate
Mapper ana Reducer instances as necessary, and then run the jo by caling the runTask() method. The
Simplest way to define a factory Is with the flowing code:
Hinclude*nenplateractory hh
itps:developer yahoo.com hadooptutor aimed. ml
waras2015 Haoop Tutorial - YDN
void matniy ¢
1) Santen are insented to the factory via tompiares
// Toto! Substitute your own class nenea. 12 below
Neqpisteractory2 factory (1
Running a Pipes Program: After aPises program has been writen and compiled, it can be launched as @
jb with the folowing command: (Do ths in your Hacoop home doctor)
5 bin/hadvop pipes ~inpst inputrarn ~ootpue outpurfarh ~procran path/so/pipea/e|
‘This wil deploy your Pipes rogram on al nodes an run the MapRedice jo through By running
bin/hadoop pipes wih ne option, you can see adctional usage infomation which describes how fo sot
Aadtona configuration values a8 necessary
‘The Pipes API contains additonal functional to allow you fo read settings trom the Jobo, overide the
Parone class, and use RecordReaders in 9 more eect fashion for high: perfomance. See the header
flea in eve/Linux-1386-22/ nclude/nado for mre information
HADOOP STREAMING
\Whoreas Pipes isan API that provides close coupling between C++ aplication code an Hadoop, Streaming
Is a generic API that allows programs wilten in vitally any language to be used as Hadoop Mapper and
Reducer implementations.
‘The oficial Hadoop documentation contains thorough intrution to Streaming, and bree notes on the
wiki A boa overview is presented here,
Had Streaming allows you to use arbitrary programs forthe Mapper and Reducer phases of @ MapReduce
|b. Both Mappers and Reducers receive thal inact en stan and emit output key, vale) parson stdout.
Input and outps are always represented textual in Streaming, The Input (ey, value) pas are writen to
stain fr a Mapper or Reducer, wih @ tab’ character separating the key trom the value. The Streaming
programs shou spt te input onthe ist tab character on the ln o recover the kay anh value,
‘Streaning programs wite their output o stdout inthe same format: key \t value
‘The inputs to the reducer are sored that wile each line contains only a singe (ke, vale) pa
vals fr the same key ar aacent to one anther.
Provided it can hand ts input inthe txt format described above, any Linux program or tol can be used as
the mapper or reducer in Streaming, You can also write your own sits in bash, python, ped, o another
language of your choice, provided thatthe necessary interpreter present on all aces in your cute.
Running a Streaming Job; To run a job wth Hadoop Streaming, use the fellowing conmane
5 bin/hadeop Jar contrib/atroaning/hadsop-wersion-arroaming Jar
‘The command as shown, with no arguments, wll pint some usage formation. An example of How fo ran eal
commancs is given below
§ bin/hadoop jar contein/atreaning-nadaop-0;18.0-atreaning. jar ~tapper \
Thoebot feore/eenesvors/pach
‘This assumes that myMapProgram and myReduceProgram ave present on all nodes inthe system ahead of
time. If this i not the case, but they are present onthe node launching the jb, then they can be "shipped to
the other nodes wit the = Fle option
5 bin/hadoop jar contrib/streaning-hadoop-0, 19. 0-stroaming.jar ~mapper \
Shnaberogran ~fiie aySetluceProgran sinpat some/ats/pach \
Toutpat some/other/ase/paen
‘Any other support files necessary to un you pregram can Be shipped inthis manner as wal
itps:tdeveloper yahoo.com hadooptutor aimed. ml
vara52015 Haoop Tutorial - YDN
Conclusions
This modue cescrivd the MapReduce execiton platform at the heat of the Hadoop system. By using
MapReduce, a high degree of parallelism can be achieved by applications, The MapReduce framewark
provides a hgh degre of faut tolerance for aplication runing on it by tiiting the communication wich
can occur between nodes, and requiring appllestons tobe witen na “dattiow-centre” manne.
Solution to Inverted Index Code
‘The following source code implaments a solision tothe lavatd indexer problem posed a he checkpoint
The source code is stuctwally very similar tothe source for Word Court nly afew lines relly need tobe
mode.
GEpart javecoeii-resracors”
import org.apache ,hadoop. 10.LongWritabley
SNDOEE Or apache /hadoop napzed.F{leouiputFartaty
SOE, SECAESHeCNseoop naps one fen
Enport oegapache .nadoop snapred.#apRoauceBases
SERGE, SESARSSHEINteoop Raps eueess
public class LineIndexes |
public static clase LinerndasMapper extends MepReducedace
Snplesents Happerchongisstapie, Text, ents Texts
Private fs
stacic Toxt word = now Text (17
1 Staiie Text location = new seats
public void rap(honguritable key, Text val,
Sara eee neat oulptt, “Refactor cepocten
Filespiit fiiespiit ~ (Psiesplse) reporter cettapstselicd:
String filelage.~\filesplic.gecoatnt) -getilane 7
Soeation.sar(seiadanel?
Stsing Line = vat-tostziag
Sscingrokenszer ier = new stFingtokent rer {1ine-tolowerCaas(b)2
WhLle. ltr spasMorezoxens (1) 1
werd.set(ier-nextioken (})2
Bstpat esi lec iworay Location)»
publig static class Lineindexnedicer extends waphedvoetace
Splessnts Raducezcienty Texty Texey Text> |
public vold reduce(ex: key, Teratorctext> values,
throws ZoEmseption (” vee
boolean first = trues
Bteiagiuiiasr tohsturn ~ new SteingDuiléer);
‘hile (values neehox= 1) |
ie lest
oketsen.apeena(*,
toketurn append ivatuas next () tostrizg())7
output collect (kay, new Soxt (roketuen.tostring
Vote actual ssin() method for our programs this te the
f Mdsiter" for the wapseduce. 12:
publig static void main(steing(] args) (
SobéLignt eivent = pew Jaueisest ie
Hebsost conf = new Jabconstlanetndenor class)
conf.setJobiaze(*Linerndexe=") 2
conf setOutpuckeyciase (Text ciaee)s
Sent \secostsucvalunciaasiexe cane) 2
FileoutputPornat-setousputPechicont, new Path moutput"))
conf. setwapperclass iLineindextapper.class):
Sent seikeducesCiess {Lineladexteduccr.claas) +
itpsideveloper yahoo.com hadoptitoralimodled ml
194ras2015 Haoop Tutorial - YDN
e1tent eetcont (cont):
nyt
Sobel Lent.runson (eon +
| eaten tixcepesen s)
etprinestacktracet) +
Previous module | Table of covers | Next module
*Hadcop Tutorl fom Yahoo!” by Yahoo! Ine. is Veensed under a Crestive Commons Altibution 2.0 Unported
License,
Products Blog Forums MyApps Careers Privacy Terms
itps:tdeveloper yahoo.com hadooptutor aimed. ml
wa