0% found this document useful (0 votes)
741 views13 pages

Understanding Inputs and Outputs of Mapreduce

The document discusses the inputs and outputs of MapReduce jobs. It explains that MapReduce takes key-value pairs as input and produces key-value pairs as output, where the key and value types can range from simple to complex data structures. The map function produces an output type that is used as input to the reduce function. Keys must implement a WritableComparable interface to allow sorting before the reduce phase. Input types are transformed to output types via the map and reduce functions.

Uploaded by

Divya Panta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
741 views13 pages

Understanding Inputs and Outputs of Mapreduce

The document discusses the inputs and outputs of MapReduce jobs. It explains that MapReduce takes key-value pairs as input and produces key-value pairs as output, where the key and value types can range from simple to complex data structures. The map function produces an output type that is used as input to the reduce function. Keys must implement a WritableComparable interface to allow sorting before the reduce phase. Input types are transformed to output types via the map and reduce functions.

Uploaded by

Divya Panta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNDERSTANDING

INPUTS AND OUTPUTS


OF MapReduce

DIVYA PANTA
21109
MapReduce Theory
Map and Reduce functions produce input and output
– Input and output can range from Text to Complex data
structures
– Specified via Job’s configuration
– Relatively easy to implement your own
• Generally we can treat the flow as
(K1,V1) → list (K2,V2) reduce: (K2,list(V2)) → list (K3,V3)
– Reduce input types are the same as map output types
Map Reduce Flow of Data
Node#1
Mapper Map
Data
Task Outp
Split
ut

Reduce Reduce
Task output

Data Mappe Map


Split r Task Output
Key and Value Types
Utilizes Hadoop’s serialization mechanism for writing data in
and out of network, database or files
– Optimized for network serialization
– A set of basic types is provided
– Easy to implement your own
• Extends Writable interface
– Framework’s serialization mechanisms
– Defines how to read and write fields

4
Key and Value Types
Keys must implement Writable Comparable interface
– Extends Writable and java.lang.Comparable
– Required because keys are sorted prior reduce phase
• Hadoop is shipped with many default implementations
of Writable Comparable
– Wrappers for primitives (String, Integer, etc...)
– Or you can implement your own

5
Inputs and Outputs
 The MapReduce framework operates on <key, value> pairs,
that is, the framework views the input to the job as a set of
<key, value> pairs and produces a set of <key, value> pairs
as the output of the job, conceivably of different types.
 The key and the value classes should be in serialized manner
by the framework and hence, need to implement the
Writable interface.
 Additionally, the key classes have to implement the Writable-
Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job − (Input) <k1,
v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
INPUT OUTPUT

MAP <k1, v1> list (<k2, v2>)

REDUCE <k2, list(v2)> list (<k3, v3>)

7
Example

 Let us understand, how a MapReduce works by taking an


example where I have a text file called example.txt whose
contents are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
 Now, suppose, we have to perform a word count on the
sample.txt using MapReduce. So, we will be finding the
unique words and the number of occurrences of those
unique words.

8
Cont...
Cont...
 First, we divide the input in three splits as shown in the figure. This
will distribute the work among all the map nodes.
 Then, we tokenize the words in each of the mapper and give a
hardcoded value (1) to each of the tokens or words. The rationale
behind giving a hardcoded value equal to 1 is that every word, in itself,
will occur once.
 Now, a list of key-value pair will be created where the key is nothing
but the individual words and value is one. So, for the first line (Dear
Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The
mapping process remains the same on all the nodes.
Cont....
 After mapper phase, a partition process takes place where sorting
and shuffling happens so that all the tuples with the same key are
sent to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc. 
Cont....
 Now, each Reducer counts the values which are present in
that list of values. As shown in the figure, reducer gets a
list of values which is [1,1] for the key Bear. Then, it
counts the number of ones in the very list and gives the
final output as – Bear, 2.
 Finally, all the output key/value pairs are then collected
and written in the output file.
Thank you

You might also like