Hadoop Tutorial - YDN

Hadoop tutorial from Yahoo

Uploaded by

Vimox S Shah

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

200 views

Hadoop Tutorial - YDN

Hadoop tutorial from Yahoo

Uploaded by

Vimox S Shah

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 14

ra52015 Haoop Tutorial - YDN ‘Search YON ean Fick = Mobla More Bm Search My Apps Apache Hadoop Module 4: MapReduce Previous module | Table of contents | Next module Introduction Documentation Forum MapReduce isa programming model designed for processing large volumes of data in pall by diving the Work ino set of independent tasks. MapReduce programs are writen ina particular syle influenced by fuaretional programming constets,spectaly kms for processing sts of data, This module explain the nature of tis programming model and how i ean be used to we programs which run i the Hadoop Goals for this Module: + Understand functional programming at applies to MapReduce “+ Understand the MapReduce program flow + Understand how to wie programs for Hadoop MapReduce + Loam about ational features of Hadoop designed toa software development Outline Inosucton Goals for this Module Outing Prerequisites MapReduce Basics Functional Programming Concepts List Processing Mapping ists Reaucing Lists Putting them Together in MapReduce ‘An Example Application: Word Count “The Driver Mato 65, MapReduce Data Flow 1. A Closer Look 2. Aécitional MapReduce Functoraty ‘2. Fault Tolerance 7, Chockpoin 8. More Tine 1. Chaining Jobs 2, Troubleshooting: Debugging MapReduce 3. Listing and Kling Jobs 8. Acaitiona! Language Support 1. Pipes 2. Hadoop Streaming 10, Conclusions 11. Solution to Inverted Index Code Prerequisites This module requires that you have Sot up a us envionment as described in Module 8. If you have not stready contguted Hadoop and successully run the example applications, go back and do so now. MapReduce Basics FUNCTIONAL PROGRAMMING CONCEPTS MapReduce programs are designed to compute largo volumes of data ina parallel fashion. This rogues ‘ving the werkioad across a large numberof machines. This model woul nt scale to large clusters (hundreds or thousands of nodes) ithe components were allowed fo share data arbitra. The ‘communication averhead requ to Keep the data onthe nodes synchronize at allies would prevent the systom from performing relay or efciontly at large seal. itps:tdeveloper yahoo.com hadooptutor aimed. ml 4sansaois Haoop Tutorial - YDN Instead, al data lomons in MapRedice ar immutable, meaning that they cannot be updated. Hin a ‘mapping ask you change an input (key, value) pa, does rot get flected back In the input fs ‘communication occurs only by generating new ouput (key, valis) pais which ae then forwarded by the Had system into the next phase of execution. LIST PROCESSING Conceptual, MapReduce programs transform iss of input data elements no Hts of cup data elements [A MapReduce program will doth twice, using vo diferent st processing iioms: map, and reduce. These toms are taken from sovaral Ist processing languages such as LISP, Scheme, or ML MAPPING LISTS. Te fst phase of « MapReduce program is called mapping Ast of data elements are provided, one at 3 time, toa function calla the Mapper, whic transforms each element incviully to an output data element Input ist Mapping function Output list Figure 41: Mapping ereaies a new ouput st by applying a function to ndvidual elements of an input st ‘Asan example ofthe uty of map: Suppase you hada function toupper st) whieh tums an uppercase \etson cf ita input strng, You could use ti function wrth nap totum a tof stings it alist of upperease Strings. Noto that we ae net mealying he input sing hor: we are rtuing a now sing that wil frm par of a now output Ist. REDUCING LISTS. Reducing lots you agorogate valos togsther. A reducer function receives an for of input values from an input st. It then combines these values together retuming a single output value input ist | [][ Reducing function Output value Figure 4.2: Reducing a Ist terates over the Input values fo produce an aggregate value as output Reducing is often usec to precuce “summa” data, tuning a large volume of cata nto smaller summary of sol. For example, "=" can be used as a reducing function, to rtum tho sum ofa Ist of input values. PUTTING THEM TOGETHER IN MAPREDUCE: The Hadoop MapReduce framework takes these concepts nd uses them to process large volumes af Infomation. & MapReduce program has two comparenis: one that implements the mapper. and ancther that implements the reduce. The Mapper and Reducer isome described above are extended sigh to work in this ervzonment, but the basic principles are the same, Keys and values: In MapRedice, no value stands on ts ou. Every vale has a key associated witht. Keys Identity related values. For example, alg of time-codes speedometer readings trom multiple cars could be eyes by lconse-plat number i woud lo ik: 444-123 6580, 12:0008 Macias kaopn) 12:05pm itps:tdeveloper yahoo.com hadooptutor aimed. ml aara52015 Haoop Tutorial - YDN “The mapring ar reducing functions receive no just values, but (key, vas) pas. The output of each of those functions isthe samo: both a Key and a valve must be emited to the nox tin the dats Now, MopReduce is aso lass strict than other languages about row the Mapper and Reducer wor. In more forma funetonsl mapping and redvcing stings, @ mapper must produce exactly one output oloment fr each input clement, and a reducer must produce exactly one output element for each input Ist. In MapReduce, an abitrary numberof values can be ouput from each phase; & mapper may map one input into ze, ons, cone huncres opus. & reducer may compute over an input Fat and emit one ra dazen differen output. Keys divide the reduce space: A reducing function tums a large ist of values into one (ra few) output value, In MapReduce, all ofthe outst valies are not usualy reduced together. Al fthe values with he same key are presented to a single ducer together. Ths is performed independently of any rece operations occuring on other ists of vals, with deren Keys attached, Rae? Figur 43: Ditforont colors reprosort diferent keys. Al values with tho same Key aro presented toa single reduce task [AN EXAMPLE APPLICATION: WORD COUNT ‘A simple MapRedce program can be writen fo dtemine haw many mes diferent words appesrin a set of fs. For examplo, if wo had the os oo. Sweet, his is the foo fle bart: This isthe be fle Wie woud expect the output tobe: bar "Naturally, we can write program In MapReduce to compute this ouput. The highfovel structure would lok ket rapper (filenan cale tora, fle-concental: ducer (word, valves! fenit reeds a2) Listing 44 High-Level MopRedce Word Count Soveral instances of the mapper function are creatod onthe iflornt mactinas in our cluster. Each instance receives a étfoont inp le (ts assumed that we have many such fle). The mappors op (word, 1) ars which are then forwarded tothe reducers. Several instances ofthe reducer method ae also istariited lon the ferent machines. Each reducer is resparsible for processing the Ist f values asscclated wih a ferent word. The tof values wil be a st of 13; the reducer sums up thse ones int a final count associated witha single word, The reducer then emits the ial (word, count) output which i waten to an output te. Wo can write avery similar program to this in Hadbop MapReduee is included inthe Hadoop cistrbution in ‘sc/exanples/org/apache/nadoop/exanples/MordCount. java. It's paral reproduced below: implements Mapper { itps:developer yahoo.com hadooptutor aimed. ml anaras2015 Haoop Tutorial - YDN private Text word = new "ext? public votd cep(Lonzlirstable key, Text vaiue, Sucpuccol lestorstewey Intirstable> ou Repseter repocter) chrows 10sxcuption Steing Line = vsise.sostring() Steingrokenteer ite = new stelngtoxentzer (Line): while! erenssviorsioxens ()) | Wword.cot erwaextTocen( Sufputicettecs woes, one)? i {Jp reouver class chat just entts the sum of the input values public static class Reece extends Maphadvoatace inplonents ReducercTens, Inteitable, Text, atiziteble> { public vold reduce(axs fay tueratorctotiricable> valuty . ag, Mporter teporter) ehrows TORxception | while (values hastext()) sun ov values nent) get Gutput collect (xey, new TntWeitale (sum) ' Listing 4.2: Hadoop MapReduce Word Count Source ‘Tre are some minor dtfeences betwoon tis actual Java implementation andthe pseudo-codo shown above. Fist, Java has no native enit Keyword Ihe QutputCollector objet you are given as an input will receive values to emit to the next stage af executon. And second, the default input format used by Hadoop reser each line of an input fleas a separate input othe mapper function, nt the entree at atime. It aso uses a Stringrokentzer choc o break up tho Ine into words. This doesnot porfrm any nowmalization ofthe input, 50 "cal, “Ca” and "ea." ar all regared as eiferent stings. Note that the lass-varable word is reused each time the mapper outs another (word, 1) pig, this saves time by no allocating a new vatale for each output. The output collect{) method wil copy the values It receives as input, $0 you a {roe to overnta the vanabls you THE DRIVER METHOD ‘There is one fral component of @ Hadoop MapReduce program, called the Diver. The aver ialzes the job and insite the Hadoop platform to execute your code ona set of input fas, and cotols whe the output es ae placed. A eteanea-up version ofthe diver ram the example Java implementation that comes wth Hadoop is presented below: Publie void run (String inpatPath, String ostpucPach) throws Exception ( 1) whe keys are words (steinge) ont, setoutpuckeyciass (text clase) FP he selves aze counes (ante! Conf: sotostpucvaiveciaas {inehritable.class) + conf sotusppesCiass ihapciasa.clase) Seni setReaucesClase (teduce.ciaa0) Esletnpotfornat.addraputPath (oonf, new Path inpuceat3))? PilcostputPorsa-setouepuePach icone, ew Pach (euspet Zac} SoveLiene.runzee (cont sting 4.3: Hadoep MapReduce Word Count Driver This method sets up a job fo execute the word count program across all he files n a gven input cectory (the inputPath exument). The output fom the educors are writen int losin tha directory identi by ‘outputPath. The configuration information tora the jobs caplred in the JobConf object. The mapping and reducing functions ar identified by the setMapperClass() and setReducerClass() methods. The data types emit bythe reducer ae identfiod by setOutputkeyClass() and setOutputvalueClass(). By etal, its assumed that these re the output ypes ofthe mapper as wall. this i not the ase, the methods settapovtputKeyClass() and setMapOutputValueclass() methods ofthe JobConf class wil ‘verde these, The input types fed to the mapper are controled by the InputFormat used, Input formats ae ciscussed in mote deta below. The default input format, “TextinutFoma,” wil oad data in 38 (Congiabie, Text) pars. The lng value iste oye offset of he linen the fl. The Text abject lds the sting caters ofthe lne of he fle, ‘The call 0 206C1Lent.runtob(conf) wil submit the jo to MapReduce. This cal wil block unl the job completes. I tho job fai, wl how an IOException. JobClent also provides a non-blocking version called svonit0b() itps:tdeveloper yahoo.com hadooptutor aimed. ml asansaois MapReduce Data Flow "Now that we have een the components that make up @ baste works together at higher lave Haoop Tutorial - YDN MapReduce jo, we can see how everything Values exchanged by shuffle process nese pei cnaGD| aap womens | A | | Coc Aon Node 1 Node 2 Node 3 cece ee i ele poaipicoy | | Gera | | bebe oc | aim || om | | on he Figure 44: Highiovel MepReduice pipeline MapReduce Inu ypially come fom Input les loaded onto our processing cluster in HOF, Theses are every estrbuted acres al ow aades. Running a MapReduce program involves runing mapping tasks on [Each ofthese mapping tasks ie equivalent na mappers nave Patcuar identities” associated with them, Therefore. any mapper can process any input fle. Each mapper many oral ef the nodes in our clus toads the so of files focal to that machine and process thom, ‘When the mapping phase has completed the ntamesdat (key, vase) pire must be exchanges between ‘machines to sonal values with the same key to single rodicer, The ede tasks are spread crass the sme nodes in he cluster asthe mappa. This i the only communication step in MapReduce. individual ‘map tasks do not excnange information wth one anche, na ars thay aware of one anthers existance. Sima, ferent reduce tasks do nat conrmuricate with one anoher. The user never expicily marshals information from one machine to aather al data transfer is handled by the Hadoep MapRedce platform sal, quidodimplcnty by the fen koys associated wih value, This sa fundamental element o Hadoop MapReduce'sralablty. I nodes inthe cluster fl tasks must be able o be restarted, I they have been performing side-afects,e., communicating wih the cuside wot, then the shared state must be restored in ‘restarted task, By elminaling communication and sio-ffects, restarts can be handled more gracofly ACLOSER LOOK ‘The previous figure described the hightevelvew of Hadoop MapRedice. From tis dagram. you can = whore the mapper and reducer compenens othe Word Count application ft in, ana how it achiowes is objective, We will ow examine hs syston in a Bt closer dtl, itps:tdeveloper yahoo.com hadooptutor aimed. ml 54sansaois Haoop Tutorial - YDN Node 4 Node 2 pute | ‘nputormat 4 f=] re ye 2 = wunven | 4 4 i Pevere | [meni LY NT esc : eS “ i Figure 4.6: Detaled Hadoap MapReduce data Now Figure 4.5 shows the ppsine wih more ofits mechanics exposed. While only two nodes ae depicted, tho ‘same pipeline can be replicated across a very large numberof nodes. The next several paragraphs describe ‘ach o the stages of a MapReduce program more pocisaly. Input files: This is whore the data fora MapReduce task i initly stored, While this does nat ned to be the caso, the put les typiealy reside in HOF, The format of thos fies i arbiary: wile ne-bated og les can be uses, we could ls0 se a binary format, mune ip records, or someting else etily. is typical fr thas input fs to be vary large ~ tens of gigabytes or more InputFormat: How these input fle are spt up and read is defined by the Inter. An InputFormat is @ ‘lass that provides the falling functionality: + Selects tho files orator objcts that shou bo used for input + Defines the inoutSpits that break fo into tasks + Provides a factory or Recerdeader objec that rad the le Several InutFormals ere provided wth Hadoop, An abstract lye is called File!nutForma; all npulFormats| that oporat on fs inherit functionality ae properties from this class, When starting a Hadoop jo, FlelnputFormat Is provided wth a path containing les to read. The FeinpuFormat wil red ll le in ths ractory.Ithon divides these los ino one er more InputSplts each. You can choose which InputFormat to ‘apply 1 your input files for ajo by calling the setnputFornst() method of the JabConf abject that defines tho job. A tab of standard InputFormats is given below. InputFormst: Description: Koy: Valve: Texting ormat Dofaut format; rads nes of text Tho byte offset of the ine |The line files contents KeyValucinputFormat Parse line into key, val pars Everything up tothe fst The remainder tab character ofthe ine SequencsFlenpulFomst A Hadoopspectic high sersetined serdetined perfomance nay format “Table 4.1: IpulFermats provided by MapReduce ‘The default IngutFormat is the TextinputFormat, This treats each line ofeach input fleas a separate recor, and perfor no parsing, Th weful or unformatted dala of Bne-based records Ike lg les. A more lereting input format i te Key ValusinptFormat. Th format alo treats ach tne of input 9 a separate record, Whi the TextnputFormat rats the ents ing a6 the value the KeyValelnputFormat beaks tne Ene lisafnt the key and value by searching fo atab character. This is particularly use fr reading the output of ene MapRoduce job as the input to another, as the default OulpulFormat(Geserbed in more deal below) formats ts result in this manne, Final, the SequencoFeetFomal ads special Day es that are specie te Hadoop. These fles include many features designed to allow data tobe rapidly read nso Hadoop itps:tdeveloper yahoo.com hadooptutor aimed. ml ansansaois Haoop Tutorial - YDN ‘mapper. Sequence files are block-compressed and provide direct seralzation and deseilization of several abitary data types (ot just text). Sequence fles canbe generated asthe output of other MapReduce tasks fd ar an efcient ntrmecate representation for data thats passing from one MapReducs ob to anther. InputSpilts: An InpulSplt desenbes a unt of work that comprises single map task na MapReduce program, A MapRecuce progam aplied to a dataset, callctvely refers to a8 2 od, made up of several (possibly soveral hundred) tasks. Map tasks may involve reaing a whole fle; thay often involve reading erly at of fil, By taut, the FieinplForat and its descendants bak a fie up ilo 64 MB chunks (he same size as blocks In HDFS). You ean contol this value by setting the napred. in. split.size parameter In nadoop-site.xnt, orby overriding the parameter n the JobCon# eoject used to submit a parieular MapReduoe job. By processing a fle in chunks, we allow several map task to operate ona single len paral Ifthe flo vory largo, this can improve performance sigificandy through paalelsm. Even moro Importantly, since the various blocks that make up the fle may be spread across several aiferent odes in the cluster, allows tasks tobe scheduled on each of these diferent nodes; te vidual Rocks ae thus all processed locally, instead of needing to be transfered from one node to ancher. Ofcourse, wl og fas ‘an be processed i hie piecesnse fashion, some fil formats are nat amenable to chunked processing, By wating a custom InputFormat, you can contol how the fe is broken up (ors not broken up) it splits. Custom input formals deserted in Meu 5 ‘Tho InutFormal defines the Histo tasks that make up the maging phase; each task conasponds toa single Input spi. The tak are then assigned tothe nodes in the system based on where the inl fe chunks are Plysically resident. Aninvidual node may have several dozen tasks assigned tot. The node wil begin ‘working 0 the tasks, etlompting te perform as mary in parle as it can, The encode paralism is controled bythe napred. tasktracker.nap, tasks.naximue parameter. RRecordReader: Tho InputSplthas defined a slice of wetk, but doos nt describe how to accoss it. The FRocordRoader class actualy oad te daa frm ts source and converts it into (key, vale) pars sable for reading by the Mapper. The RecordReader instance Is defined by the InptFoemat. The default IputFormt, TextinputFomat, provides 9 LineRecoraReader, which eats each ine ofthe input fe 9s new vate. The key associtod with each Ino ts byte offset inthe fle. The RecordReador is invoke roptedly on tho input Unt the ani put Split has been consumed. Each invocation ofthe RecordReader leas to another call to tho nap() method of te Mapper. Mapper: The Mapper performs tho interesting usar-tnod work ofthe fst phase o the MapRoduce program. {Given a key and a valu, the nap() method emits (ke, value) pas) which are forwared fo the Reducers. A raw instance of Mapper instantiated na separate Java process foreach map task (IngutSplt tha makes Lup ar ofthe total jb ng. The individual mappes ae intertionally not provided wth a mechanism to communicate wth ene anctherin any way. This allows tho realty of each map task tobe govern solely by the relabity ofthe local macrine. The map() method receves two parameters inaction to the Key and + The OutptCallactr object has a method named colect() which wil forward a key, value) pair tothe reduce phase ofthe joo. +The Reporter objet provides infomation about the curent task; ts getEnputSp14C() method wl tum an object deserting the curent InputSelt a allows the map tas to provide actions information out ts progress tothe rst ofthe system. The setSeatus() method allows you to emit a status message back tothe user. The incrCounter() method allows you to increment shared performance Counters. You may define as many arbitrary couriers as you wish. Each mapper ean increment the Counters, and the JobTracker wil elect ine increments made by the different processes anc aggregate thm for lterreireval wen the ob ends. Parton & Shu: Aftor tho fst map tasks have completed, the nodes may stillb pearing several ‘more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to whore thoy are requlea by the reduces. This process of moving map outputs to the reducers fs known a8 shutting Aaferent subset ofthe intermediate key space is essigned to each reduce node; these subsets known as “prttons) are the inputs to the reduce tasks, Each map ask may emt key, vale) pas to any parton: all ‘valves fr the same key are always reduced together regardloss of which mapper is is origin. Therefore, the ‘map nodes mut all agree on where to send he fren pores of the intermodite data, The Patton class determines hic partion a given (key, value) pa wil go to. The default prions computes @ hash value forth key and assigns the partion basad onthe result, Custom pattioners are deserbed in mors deal Module 6 Sort: Each reduce task is responsible for reducing the values associaed wih several intermediate keys. The set of intermediate keys ona single neds is automaticaly sored by Hadoop bofrs they are preserid tothe Roser Reduce: A Reducer inslance is created foreach reduce task, This is an instanceof user provided code that prlorms the second important pase of ab-spactc wrk, For each Kay in the patton assigned o 8 Reducer, the Reducers reduce() method is called once. This receives a key a5 wall as an eater over all the values associated withthe kay. The values associated with a key are returned by the erate in an Undefined order. The Rader also raceves as parameters QufpCallectr and Reporter objects: thoy aro Used in the same manner asin the nap() method ‘OutputFormat: The (key value) pas provided ta this OuputCollector ae then writen fo output flee, The way they are writen fs governed by the Output omar. The OxtputFormat functions much tke the InputFormat class deserbed eater. The instances of OutputFenmat provided Yy Hadoep wie to fles onthe local isk orn HOFS; thoy all ihert om 2 common FleOutputFormat. Each Reducer writes a separate file itps:tdeveloper yahoo.com hadooptutor aimed. ml m4sansaois Haoop Tutorial - YDN in a commen ouput rectory. These fies wil iypically be named part-nnnnn, where non isthe patton ié associated withthe reduce task. The ouput rectory is sel by the FileOvtputForwat. setoutputPath() method. You can contol woich patcuar OuptFomat is used by cag the setOutputFormat() method of tha JobConf object that defines your MapRaduce ob. A table of provided OutoutFormats is oven below. OutputFormat Description TextOutputFormat Default wits Ines in “key Ut value form SequenceFleOutp.tFormat Whites binary fles suitable for reading no Subsequent Moped jobs INulurputFomat Disrogards ts inputs Table 4.2: OutputFormats provided by Hadoop Hed provides some OuipulFarma instances to uite to fle. The base (Seat) instance is ToxtOutpuFormat, wich wits (key, vale) pais on indviual nes ofa tot file. This can be easily road by a ater MapRoduce task using the Key alualnpctFomat class, and is aso humanreadabe, A beter Intemediatefomat fr use between MapReduce jobs Is the SequenceFileOutputFonnat wich rapily sefiazes arbitrary datatypes to the fle; the corespening SequenceFllngutFomat wil dessrilize the fle into th same typas and presens the data tothe nex! Mapporin the samme manner asi was emitted by tho provious Reducer. The NulOutpuForat generates no ouput fles and disrogards any (key, value) prs passed tol by the OutptCatlector. This useful if You are explety writing your om outs es in he rreduce() method, and do nat want adationsl empty output Fes gonerted by the Hadaop framework RecordWiter: Much ike how the InulFermat actualy rads Invi records through the RecordReader Implementation, the OutputFomat class isa factory for RecaraWiter obec; these are used to wite the Incvidual records tothe les as rected bythe OutputFormat. ‘The output files wnten by the Reducers ar then left in HS for your use, ear by ancther MapReduce| |b, @ separate progam, for for human Inspection ADDITIONAL MAPREDUCE FUNCTIONALITY Node 1 Node 2 Iroc voars | | a freemen ie | [me | [me mp | [mee mating / remedaaee a Siete macs) ‘sma kv ‘asexchorged ranean gre 46: Combiner stop inserod into the MapRodoce daa low ‘Combiner: The pipeline showed eater omits a processing stop which can be uted for optimizing anit usage by your MapReduce jo. Called the Combinor, this pass rus afr the Mapper and befor the Reducer. Usage of te Combines optional I this pass Is sutabe for your jb, instances of the Combine lass ae run on every node that has run map tasks. The Combiner wil receives input a data emitted bythe Mapper instances on a given node. The ouput fom the Combiners then sent tothe Reduces, isisad ofthe output trom the Mappers. Tho Combiners a "minixeduco” process which operates only on data goneratod by fone machin ‘Word count isa prime example fr whore a Combiners useful. The Word Count program in stings 13 emis (word, 1) pat for every instance of every word i sees. So he same document contans the werd eat” 3 tins, the pale "cat, 1) ls emvted thee times; all ofthese are then sent to the Reducer. By using a Combiner, these can be condensed into a single ("cat™, 3) pairto be sent tothe Reducer. Now esch node only sends a single value to the reducer fr each word ~erasticaly reducing the total bandwidth cequed for tha shuttle process, and spooding up he job. The best pr of als that we do not nood to wits ary aditional code to take advantage of thst If reduce function is bath commuative and associative, thon kt can be used ‘8 a Combiner as wal, You can enable combining in the word count program by ang tho following He to the driver itps:tdeveloper yahoo.com hadooptutor aimed. ml ansansaois Haoop Tutorial - YDN ‘The Combiner shoul bean instance ofthe Reducor interface. I your Reducer isl cannot be used érocty 1 a Combiner because of commutativly oF associativity, you might Sti be able to write Hd clas to use ‘8 a Combiner for your job FAULT TOLERANCE (One ofthe primary reasons to use Hadoop o run your jbs is de Lots high dogree of fault tolerance. Even when unning obs ona age cluster whee individual nodes or network components may experience high rates of flu, Haoop an gui jobs toward a successful completion. The primary way that Hadoop achieves fault tolerance is though eslaing tasks, Invidual task nodes (Tasktrackers) ae in constant communication wih the heed node ofthe system, called the Job Tracker. I a TaskTracker als to communicate withthe JobTracke or a peiod of time (by detaut, 1 minut), the JobTracker will assume that tho TaskTracker in question has crashod, The JobTrackor knows which map and roduc tasks wore assigned to each TaskTracker IF he job stil n the mapping phase, then other TaskTrackors wil be asked 'o r-xocue all map lass proviously run by the sled TaskTeackr, Hf he jo fin the reducing ag, than orer TaskTracker wl ‘execute al reduce tasks that were In progress on the fall Task Tracker. Resluce tasks, once completed, have been wien back to HDFS, Thus, ia TaskTracket has aldy completed two out of tse reduce tasks assigns tf rly the tie task must be execited elsenher. Map tasks are slightly more complicatee: even a nade has completed tan map tasks, the reducers may not have al copied thelr inputs from tho cuputof those map tasks. If node has crashed, thn ts mapper outputs are inaccessible. So ary aready-completed map tasks must be re-executed to make thei results available othe rest ofthe fesulng machines. lof his handed automatiaty by the Hadoop plato. ‘This fault tolerance underscore the noed for program execition tobe side-tec ree. If Mappers and Reducers had individual idertites and communicated wih one anther or the outside word, then restarting a task would requ the cher nodes fo communicate wit the new instances ofthe map an recuce tasks, and the r-executed tasks would need to reestalish thelr intermediate state. This process is notoriously complicated and erorprono in tho general case. MapReduce simpitis this problem srastialy by eliminating task lentes or the ably or ask patton to communicate wth ene another in naval task sees ony its own cret inputs and knows only is oun outputs, to make tis flue and estar process clesn and opondabl Speculative execution: One problam with the Hadoop system is thet by dividing the tasks across many nodes, itis possible fora few slow nades to rates tn rest of he program, For example fone nade has @ low disk controler, then it may be reading its input a only 10% the speed of al the other nodes. So when 99 ‘map tasks ar sready complet, the system is stil wating fr the inal map task fo ehack in, wich takes much tongr than athe other nodes. By forcing asks to nin slain rom one anata, individual asks do nol know where thor inputs come from, Tasks tnt the Hadoop platform to ust dlver the appropriate input. Therefore, the same put ean be processed mutiple times in paral, to expat ferences in machine capabilies. AS mast of the tasks ina jn are coming toa cle, the Hadoan platform wl echedle redundant copies of te reining tasks across several nodbs which do nt have other work to perform, This process is known as speculative execution ‘When tasks complete, they announce this fatto the JabTracker. Whichever copy of @ ask tshes fst bacomas th definitive copy. other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon tho tasks and eiscard ther cuputs. The Reducers then receive thei inputs trom whichever Mapper comploted succossfly, frst ‘Speculative execution fe enabled by defau. You can csable speculative execiton fer the mappers and reducers by soting the mapred.nap. tasks. speculative. execution and rapred.reduce, tasks, speculative, execution JobCor options to Fase, respect, Checkpoint You now krow about al of the basic operation of the Hadoop MapRedvce platform, Try the following exereise, to 8 you understand he MapRedues programming concepts Exercie: Gven the code for WordCount in stings 2 and 3, mod this code to predic an inverted index of ls Inputs. An inverted index tums a list of documents that contain each word in thase documerss. Thus, i the word “eat” appear in documents A and, bu nt C, then the line should appear inthe ouput I the word “aseba” appears In documents B and C, the the tne should appear inthe ouput as wel itps:tdeveloper yahoo.com hadooptutor aimed. ml ansansaois Haoop Tutorial - YDN It you gt stuck, road the section on trubleshootng below. The working Solution is provided at tho endo his Hint: Tho detautInputFormat wl provide the Mapper with oy, val} pars whore tho Koy is tho bye offset Int the flo, ae the val is eof tex, Ta get the flename ofthe cure input, use the fllowing code: More Tips CCHAINING Joa. "Not every problem can be slved with a MapReduce program, but fewest with a single MapReduce job. Many problems can be solved wth MapRecuce, by wing several MapReduce stops wich rn in series fo accomplish a goal those which can be solved Mapt > Reducet > Map2 > Redueo2 > Map ‘You can easly chain jobs together in his fashion by writing muliplo diver matheds, one foreach jb. Call the frst ever rethod, which usos 20bCLient.rundab() 16 un the jb and wail foril to completo. When that job has comploted, then cal the next evar method, which erates 8 new JobConf abject refering to diferent Instances of Mopper and Reducer, et. The fst jb nthe chain should wre ts utou to path whichis then used asthe input path forthe secand jb, Tis process canbe repeated for as mary jobs are necessary to anive ata completa solution to tho pobiom, Many protlems which at fist sasm impossible in MapReduce can be accomplished by cvidng one ob into two omer. Hadoop provides anatner mechanism for managing batches of obs wth dependencies between jobs. Rather than submit @ 208Con¢ tothe 2o8Client's run2e0() of submiz30b() methods, org. apacke.hadoop .mapred, jobcontrol Job objects an be created to represent each Job; A Job takes a DebCon¢ ebect as its constructor argument. Jabs can depend en one anther trough the use of he ‘adoepensing20b() method. The code says that Job x cannot stat unl y has successfully completed. Dependency Information carrot be added to job aftr it has aeady boon sarod. ivon a Sot of jbs, these can be passed to an instance ofthe doBcontrol class JobControl can reel invidual jobs via the addI9b() method, oF a eaten of jobs ia acazobs(). The 1obContrel ject wil spawn a thread inthe lant to launch the jobs. awl obs wl be launched when ther dependences have all successfully completed and when the MapReduce systom as a rola has resources to exscute the jbs. The JobCortal interface allows you to query if retiave th stato of ndvcal jobs, a8 wal a the lt of obs wating, ead, runing, a finshed, The ob submission process oes rot begin unt the run() mathod ofthe JbControl abject calles ‘TROUBLESHOOTING: DEBUGGING MAPREDUCE \Winen weting MapRedce programs, you wil occasionally encounter bugs in your programs, infinite loops, et. Tis section dezerbes the fostus of MapRecuce hat wl hap you dlagnose and soWve these contons Log Fites: Hadoop keeps logs of important events curng program execution. By default these are stored in the Logs/ subdirectory ofthe hadoop-versfon/ dgctory where you ran Hadoop from. Log es are named hadoop-usernane-service-hostnane. log. The most receet datas inthe -1og fle; clderlogs have thelr ate appended o them. The userame in the lg flerame refers tothe usemame und which Hadoop was startod ~ this isnot necessarily the same usemame you ae using to un programs. The Service rame refers to wich ofthe several Hadoop programs are weting hetog; these can be joiracke, ramenede,datanode, seconaarynamenode, or tasktracker. All ofthese ar important for debugging whole Hadoop nstalaion. Sut forinaviual programs, the taktracker logs wil bs the most relevant. Any excoptions thrown by your program will bo recorded in tho taskirackor logs ‘The log diectery wil alsa have a subactory called userlage, Hee thee is another sublracton for avery task nin, Each ask cords its sut and stro two files in hs rectory. Note that on a muli-nose Hadoop cluster, these logs are rot centrally agarogated - you should check wach TaskNode's legs/usertogs/ arctory for ther output Debugging in the distributed setting Ie complicated and rqules logging into Several machines to access og data. I possible, pograms should be unt tested by runing Hadoop local. The default contiguation deployed by Hadoop run in “singl nstance* made, where the ene MapReduce progam is run nthe same instanceof Java as called 2obC1 ent runob(). Using a dabugge like Eclipse, you can than breakpoints inside the rap() or reduce() methods to eiscover your bugs. itps:tdeveloper yahoo.com hadooptutor aimed. ml 104ra52015 Haoop Tutorial - YDN In Modul 5, you wil lear how to use additonal ealures of MapReduce Io Sstrbute auxiliary code Lo rods inthe system. This can be use to enable debug scripts which run on machines when asks fal LISTING AND KILLING JOBS: Its possible to submit jobs toa Hadoop cluster which malfunction and send themselves into infinite loops or cher problematic states. In this case, you wl want to manual kil the jb you have started, ‘The flowing command, rin inthe Hadoop installation directory ona Hadoop cluster. wl tal he current jobs: 3 bin/hadeop job ise This wil produce output that looks something Ike: SQ HE Ease You can use this obi oki he jb; the command Is ‘Subsite th ob_2028..." fom the -1ist command for job Additional Language Support Hadoop ise is wren in Java: it thus accopis Java code natively fr Mappers and Reducers. Hadoop aso comes wth two adap ayers which allow cade witten in thar languages tobe used in MapRedvew programs. PIPES Pipes isa Ibrary which allows C++ source code tobe used for Mapper and Reducer coe. Applications which reauie igh numa! performance may see beter throughput if won in C++ and used though Pipes, This ‘ovr is supported on 32-bit Linx instalatens. ‘The inclee files and stati varies are presont inthe c+¥/Linux-1386-22/ dretory under your Hadoop Irstlation. Your apptcation should include include hadoop/Pipes.h and TeaplateFactory.nh and nk ‘against 11b/11bhadooppies.a; wih gc, include the arguments -L${#ADOOP_ HOME) /c++/Linux-1386- 32/1ib -Lhadooppipes todo he lalter. Both key and valu inputs to pipes programs ae provided as STL stings (std::string). A program must Bill doin an instance of Mappor ans Reducor these names have not changod, (They, ko al cher classes Aetna in Pipes, arv in the Hodo0pPipes namespace.) Unlixe the clases of tho same names in Hadoop set, the pap() and reduce() functions tke na single argument which 6 @ reference to an object of type MapContext and ReducoConext respectively. The most important methods contained in ach af these context objects ae const stds setrings gecinpusney (i Genet Seat isesings gertnpurvalue() Wold oriticonst sedi ietringe Key, const std::strings value) ‘The RecuceContaxt clase alo contains 2 actional method to avance the value erator: bool nextVatuet) Defining a Pipes Program: A program fo use with Pipes is deine by wing classes extending Mappor and FRoducr. (An optionally, Parone, soe Mori 5.) Hadoop must then bo informed whch classes to use to rn the jo. ‘Aninstance of your C++ program wil be stated by th Pipes framework in main() en each machine. Ths should do any (hapetly brit) eoniguation requred fr your task. Ht should then define a Factory to ereate Mapper ana Reducer instances as necessary, and then run the jo by caling the runTask() method. The Simplest way to define a factory Is with the flowing code: Hinclude*nenplateractory hh itps:developer yahoo.com hadooptutor aimed. ml waras2015 Haoop Tutorial - YDN void matniy ¢ 1) Santen are insented to the factory via tompiares // Toto! Substitute your own class nenea. 12 below Neqpisteractory2 factory (1 Running a Pipes Program: After aPises program has been writen and compiled, it can be launched as @ jb with the folowing command: (Do ths in your Hacoop home doctor) 5 bin/hadvop pipes ~inpst inputrarn ~ootpue outpurfarh ~procran path/so/pipea/e| ‘This wil deploy your Pipes rogram on al nodes an run the MapRedice jo through By running bin/hadoop pipes wih ne option, you can see adctional usage infomation which describes how fo sot Aadtona configuration values a8 necessary ‘The Pipes API contains additonal functional to allow you fo read settings trom the Jobo, overide the Parone class, and use RecordReaders in 9 more eect fashion for high: perfomance. See the header flea in eve/Linux-1386-22/ nclude/nado for mre information HADOOP STREAMING \Whoreas Pipes isan API that provides close coupling between C++ aplication code an Hadoop, Streaming Is a generic API that allows programs wilten in vitally any language to be used as Hadoop Mapper and Reducer implementations. ‘The oficial Hadoop documentation contains thorough intrution to Streaming, and bree notes on the wiki A boa overview is presented here, Had Streaming allows you to use arbitrary programs forthe Mapper and Reducer phases of @ MapReduce |b. Both Mappers and Reducers receive thal inact en stan and emit output key, vale) parson stdout. Input and outps are always represented textual in Streaming, The Input (ey, value) pas are writen to stain fr a Mapper or Reducer, wih @ tab’ character separating the key trom the value. The Streaming programs shou spt te input onthe ist tab character on the ln o recover the kay anh value, ‘Streaning programs wite their output o stdout inthe same format: key \t value ‘The inputs to the reducer are sored that wile each line contains only a singe (ke, vale) pa vals fr the same key ar aacent to one anther. Provided it can hand ts input inthe txt format described above, any Linux program or tol can be used as the mapper or reducer in Streaming, You can also write your own sits in bash, python, ped, o another language of your choice, provided thatthe necessary interpreter present on all aces in your cute. Running a Streaming Job; To run a job wth Hadoop Streaming, use the fellowing conmane 5 bin/hadeop Jar contrib/atroaning/hadsop-wersion-arroaming Jar ‘The command as shown, with no arguments, wll pint some usage formation. An example of How fo ran eal commancs is given below § bin/hadoop jar contein/atreaning-nadaop-0;18.0-atreaning. jar ~tapper \ Thoebot feore/eenesvors/pach ‘This assumes that myMapProgram and myReduceProgram ave present on all nodes inthe system ahead of time. If this i not the case, but they are present onthe node launching the jb, then they can be "shipped to the other nodes wit the = Fle option 5 bin/hadoop jar contrib/streaning-hadoop-0, 19. 0-stroaming.jar ~mapper \ Shnaberogran ~fiie aySetluceProgran sinpat some/ats/pach \ Toutpat some/other/ase/paen ‘Any other support files necessary to un you pregram can Be shipped inthis manner as wal itps:tdeveloper yahoo.com hadooptutor aimed. ml vara52015 Haoop Tutorial - YDN Conclusions This modue cescrivd the MapReduce execiton platform at the heat of the Hadoop system. By using MapReduce, a high degree of parallelism can be achieved by applications, The MapReduce framewark provides a hgh degre of faut tolerance for aplication runing on it by tiiting the communication wich can occur between nodes, and requiring appllestons tobe witen na “dattiow-centre” manne. Solution to Inverted Index Code ‘The following source code implaments a solision tothe lavatd indexer problem posed a he checkpoint The source code is stuctwally very similar tothe source for Word Court nly afew lines relly need tobe mode. GEpart javecoeii-resracors” import org.apache ,hadoop. 10.LongWritabley SNDOEE Or apache /hadoop napzed.F{leouiputFartaty SOE, SECAESHeCNseoop naps one fen Enport oegapache .nadoop snapred.#apRoauceBases SERGE, SESARSSHEINteoop Raps eueess public class LineIndexes | public static clase LinerndasMapper extends MepReducedace Snplesents Happerchongisstapie, Text, ents Texts Private fs stacic Toxt word = now Text (17 1 Staiie Text location = new seats public void rap(honguritable key, Text val, Sara eee neat oulptt, “Refactor cepocten Filespiit fiiespiit ~ (Psiesplse) reporter cettapstselicd: String filelage.~\filesplic.gecoatnt) -getilane 7 Soeation.sar(seiadanel? Stsing Line = vat-tostziag Sscingrokenszer ier = new stFingtokent rer {1ine-tolowerCaas(b)2 WhLle. ltr spasMorezoxens (1) 1 werd.set(ier-nextioken (})2 Bstpat esi lec iworay Location)» publig static class Lineindexnedicer extends waphedvoetace Splessnts Raducezcienty Texty Texey Text> | public vold reduce(ex: key, Teratorctext> values, throws ZoEmseption (” vee boolean first = trues Bteiagiuiiasr tohsturn ~ new SteingDuiléer); ‘hile (values neehox= 1) | ie lest oketsen.apeena(*, toketurn append ivatuas next () tostrizg())7 output collect (kay, new Soxt (roketuen.tostring Vote actual ssin() method for our programs this te the f Mdsiter" for the wapseduce. 12: publig static void main(steing(] args) ( SobéLignt eivent = pew Jaueisest ie Hebsost conf = new Jabconstlanetndenor class) conf.setJobiaze(*Linerndexe=") 2 conf setOutpuckeyciase (Text ciaee)s Sent \secostsucvalunciaasiexe cane) 2 FileoutputPornat-setousputPechicont, new Path moutput")) conf. setwapperclass iLineindextapper.class): Sent seikeducesCiess {Lineladexteduccr.claas) + itpsideveloper yahoo.com hadoptitoralimodled ml 194ras2015 Haoop Tutorial - YDN e1tent eetcont (cont): nyt Sobel Lent.runson (eon + | eaten tixcepesen s) etprinestacktracet) + Previous module | Table of covers | Next module *Hadcop Tutorl fom Yahoo!” by Yahoo! Ine. is Veensed under a Crestive Commons Altibution 2.0 Unported License, Products Blog Forums MyApps Careers Privacy Terms itps:tdeveloper yahoo.com hadooptutor aimed. ml wa