Chapter 2_Introduction to MapReduce_new (1)
Chapter 2_Introduction to MapReduce_new (1)
Dr. Pooja K R
Traditional Way of Parallel & Distributed Processing
Dr. Pooja K R
SOLUTION IS MAP REDUCE FRAMEWORK
1. MAPPER-Software for doing the assigned task after organizing the
data blocks imported using keys.
2. REDUCER-Software for reducing the mapped data using the
aggregation.
3. AGGREGATION-Groups the values for multiple rows together to
result a single value of more significant meaning or measurement.
4. QUERYING FUNCTION-Finding best student of class.
Dr. Pooja K R
Features of Map Reduce
1. Provides automatic parallelization and distribution of computation
based on several processors.
2. Processes data stored on distributed clusters of DataNodes and
racks.
3. Provides scalability for usages of large number of servers.
4. Provides MapReduce batch-oriented programming model in
Hadoop version 1.
5. Provides additional processing modes in Hadoop 2 YARN-based
system and enables required parallel processing of 3V
characteristics data.
Dr. Pooja K R
What is MapReduce?
● MapReduce is a programming framework .
● It allows us to perform distributed and parallel processing on large
data sets in a distributed environment.
Dr. Pooja K R
Main Phases of Map reduce
● Map: each worker node applies the map function to the local data,
and writes the output to a temporary storage. A master node
ensures that only one copy of the redundant input data is
processed.
● Shuffle: worker nodes redistribute data based on the output keys
(produced by the map function), such that all data belonging to one
key is located on the same worker node.
● Reduce: worker nodes now process each group of output data, per
key, in parallel.
Dr. Pooja K R
MapReduce Job Execution Flow
Dr. Pooja K R
Mapper in Hadoop MapReduce
Dr. Pooja K R
Mapper in Hadoop MapReduce
● Hadoop Mapper task processes each input record and it generates a
new <key, value> pairs.
● The <key, value> pairs can be completely different from the input
pair.
● In mapper task, the output is the full collection of all these <key,
value> pairs.
Dr. Pooja K R
Reducer in Hadoop MapReduce
Dr. Pooja K R
Reducer in Hadoop MapReduce
1. Input to reducer will be output of mapper <key,value> pair.
2. Hadoop Reducer takes a set of an intermediate key-value pair
produced by the mapper as the input and runs a Reducer function
on each of them.
3. One can aggregate, filter, and combine this data (key, value) in a
number of ways for a wide range of processing.
4. Reducer first processes the intermediate values for particular key
generated by the map function and then generates the output (zero
or more key-value pair).
Dr. Pooja K R
Shuffling and Sorting
Dr. Pooja K R
https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2017/01/hadoop-
mapreduce-data-flow-execution-1.gif
Dr. Pooja K R
Word Count using MapReduce Algorithm
Dr. Pooja K R
Map REDUCE EXAMPLES
Dr. Pooja K R
Daemons used in Map Reduce programming
Dr. Pooja K R
JobTracker
• JobTracker process runs on a separate node and not usually on a DataNode.
• JobTracker is an essential Daemon for MapReduce execution in MRv1. It is
replaced by ResourceManager/ApplicationMaster in MRv2.
• JobTracker receives the requests for MapReduce execution from the client.
• JobTracker talks to the NameNode to determine the location of the data.
• JobTracker finds the best TaskTracker nodes to execute tasks based on the
data locality (proximity of the data) and the available slots to execute a task on
a given node.
• JobTracker monitors the individual TaskTrackers and the submits back the
overall status of the job back to the client.
• JobTracker process is critical to the Hadoop cluster in terms of MapReduce
execution.
• When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
Dr. Pooja K R
TaskTracker
• TaskTracker runs on DataNode. Mostly on all DataNodes.
• TaskTracker is replaced by Node Manager in MRv2.
• Mapper and Reducer tasks are executed on DataNodes administered by
TaskTrackers.
• TaskTrackers will be assigned Mapper and Reducer tasks to execute by
JobTracker.
• TaskTracker will be in constant communication with the JobTracker signalling
the progress of the task in execution.
• TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to
another node.
Dr. Pooja K R
Matrix-Vector Multiplication by MapReduce
●When ranking of Web pages that goes on at search engines, n is in the tens of
billions.
Dr. Pooja K R
Problem Statement
Given,
●A vector v of length n.
●Assume that
either from its position in the file, or because it is stored with explicit
Dr. Pooja K R
Algorithm for Reduce Function:
Dr. Pooja K R
Computing the mapper for Matrix A
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have 2 values 1 & 2,
# each case can have 2 further values of j=1 and j=2.
#Substituting all values in formula
Dr. Pooja K R
Computing the mapper for Matrix
Dr. Pooja K R
The formula for Reducer is:
Dr. Pooja K R
Computing the reducer:
Dr. Pooja K R
Computing the reducer:
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)
●Case I
n is large, but not so large that vector v cannot fit in main memory.
●Case II
Dr. Pooja K R
The MapReduce Phase
Dr. Pooja K R
Case I - n is large, but not so large that vector v cannot fit in
main memory
●The Map Function:
○ Map function is written to apply to one element of M.
○ v is first read to computing node executing a Map task and is
available for all applications of the Map function at this compute
node.
○ Each Map task will operate on a chunk of the matrix M.
○ From each matrix element 𝑚𝑖𝑗 it produces the key value pair
Dr. Pooja K R
The Reduce Function:
• The Reduce function simply sums all the values
associated with a given key i.
Dr. Pooja K R
Map(I,j,Mij)
1 4 7 1
2 5 8 2 =
3 6 9 3
Vij Xi
Mij
Dr. Pooja K R
Map Function
1 (0,0,1) 1 (0,1)
2 (0,1,4) 2 (0,8)
3 (0,2,7) 3 (0,21)
4 (1,0,2) 1 (1,2)
6 (1,2,8) 3 (1,24)
7 (2,0,3) 1 (2, 3 )
Dr. Pooja K R
Shuffle and Reduce
30 36 42
Dr. Pooja K R
Output:
30
36
42
Dr. Pooja K R
Case 2: n is large to fit into main memory
● v should be stored in computing nodes used for the Map task.
● Divide the matrix into vertical stripes of equal width and divide the vector
into an equal number of horizontal stripes, of the same height.
● Our goal is to use enough stripes so that the portion of the vector in one
stripe can fit conveniently into main memory at a compute node.
Dr. Pooja K R
Division of a matrix and vector into five stripes
Dr. Pooja K R
Continue…..
● The ith stripe of the matrix multiplies only components from the ith stripe of
the vector.
● Divide the matrix into one file for each stripe, and do the same for the
vector.
● Each Map task is assigned a chunk from one of the stripes of the matrix
and gets the entire corresponding stripe of the vector.
Dr. Pooja K R
Dr. Pooja K R
The reduce( ) step in the MapReduce Algorithm for matrix multiplication
Dr. Pooja K R
•The input information of the reduce( ) step
(function) of the MapReduce algorithm are:
•One row vector from matrix A.
•One column vector from matrix B.
Dr. Pooja K R
Dr. Pooja K R
The reduce( ) function will compute
Dr. Pooja K R
Preprocessing for the map( ) function
•The map( ) function (really) only has one input stream:of the format ( key , value )
i i
Dr. Pooja K R
Pre-processing used for matrix multiplication:
Dr. Pooja K R
Overview of the MapReduce Algorithm for Matrix Multiplication
Dr. Pooja K R
Dr. Pooja K R
•The map( ) will duplicate N times as follows
where N = # rows in matrix A (= # columns in matrix B)
Dr. Pooja K R
Dr. Pooja K R
Dr. Pooja K R
Dr. Pooja K R
Relational- Algebra Operations
●There are a number of operations on large-scale data that are used in database queries.
●Many traditional database applications involve retrieval of small amounts of data, even though the
●For example, a query may ask for the bank balance of one particular account. Such queries are not
●There are several standard operations on relations, often referred to as relational algebra, that are used
to implement queries.
Dr. Pooja K R
Relational- Algebra Operations
●Selections
●Projections
●Union
●Intersection
●Difference
Dr. Pooja K R
Representation of a table in HDFS
Dr. Pooja K R
Selection in MapReduce
●Selection can be done most conveniently in the map portion alone, although they could also be done
■If so, produce the key-value pair (t,t). That is, both the key and value are t.
●The Reduce Function: The Reduce function is the identity. It simply passes each key-value pair to
the output.
Dr. Pooja K R
Selection in Map Reduce
● Selection: σC(R)
○ Apply condition C to each tuple of relation R.
○ Produce in output a relation containing only tuples that satisfy C.
Dr. Pooja K R
Selection in Map Reduce
For our example we will do Selection(B <= 3). Select all the rows where value of B is less than or
equal to 3.
Dr. Pooja K R
Selection in Map Reduce
Dr. Pooja K R
Selection in Map Reduce
Based on number or reduce workers (2 in our case). The files for reduce workers on map workers
will look like:
Dr. Pooja K R
Output of Selection
6 3
Dr. Pooja K R
Projection Using Map Reduce
• Map Function: For each row r in the table produce a key value pair r', r’, where r'
only contains the columns which are wanted in the projection.
• Reduce Function: The reduce function will get outputs in the form of r' :[r', r', r', r',
...]. As after removing some columns the output may contain duplicate rows. So it
will just take the value at 0th index, getting rid of duplicates.
Dr. Pooja K R
Projection in MapReduce
the same tuple to appear several times, the Reduce function must eliminate
duplicates.
Dr. Pooja K R
computing projection(A, B)
Dr. Pooja K R
computing projection(A, B)
Dr. Pooja K R
computing projection(A, B)
Dr. Pooja K R
computing projection(A, B)
Dr. Pooja K R
computing projection(A, B)
Dr. Pooja K R
computing projection(A, B)
Dr. Pooja K R
Union Using Map Reduce
• Reduce Function: With each key there can be one or two values (As we
don’t have duplicate rows), in either case just output first value.
Dr. Pooja K R
Union Using Map Reduce
Dr. Pooja K R
Union Using Map Reduce
Dr. Pooja K R
Union Using Map Reduce
Dr. Pooja K R
Union Using Map Reduce
Dr. Pooja K R
Union Using Map Reduce
Dr. Pooja K R
Union Using Map Reduce
Dr. Pooja K R
Intersection Using Map Reduce
• Map Function: For each row r generate key-value pair (r, r) (Same as
union).
• Reduce Function: With each key there can be one or two values (As we
don’t have duplicate rows), in case we have length of list as 2 we output
first value else we output nothing.
Dr. Pooja K R
Intersection Using Map Reduce
Dr. Pooja K R
Intersection Using Map Reduce
Dr. Pooja K R
Difference Using Map Reduce
• Map Function: For each row r create a key-value pair (r, T1) if row is from
table 1 else product key-value pair (r, T2).
• Reduce Function: Output the row if and only if the value in the list is T1 ,
otherwise output nothing.
Dr. Pooja K R
Difference Using Map Reduce
Dr. Pooja K R
Difference Using Map Reduce
Dr. Pooja K R
Difference Using Map Reduce
Dr. Pooja K R
Difference Using Map Reduce
Dr. Pooja K R
Difference Using Map Reduce
Dr. Pooja K R
Difference Using Map Reduce
Dr. Pooja K R
Grouping and Aggregation Using Map Reduce
• Map Function: For each row in the table, take the attributes using which grouping is
to be done as the key, and value will be the ones on which aggregation is to be
performed.
• Reduce Function: Apply the aggregation operation (sum, max, min, avg, …) on the
list of values and output the result.
Dr. Pooja K R
Grouping and Aggregation Using Map Reduce
Dr. Pooja K R
Grouping and Aggregation Using Map Reduce
Dr. Pooja K R
Grouping and Aggregation Using Map Reduce
Dr. Pooja K R
Grouping and Aggregation Using Map Reduce
Dr. Pooja K R
Grouping and Aggregation Using Map Reduce
Dr. Pooja K R
Output of group by (A, B) sum(C)
Dr. Pooja K R
Natural Join Using Map Reduce
• Map Function: For two relations Table 1(A, B) and Table 2(B, C) the map
function will create key-value pairs of form b: [(T1, a)] for table 1 where T1
represents the fact that the value a came from table 1, for table 2 key-
value pairs will be of the form b: [(T2, c)].
Dr. Pooja K R
Natural Join Using Map Reduce
Dr. Pooja K R
Natural Join Using Map Reduce
Dr. Pooja K R
Natural Join Using Map Reduce
Dr. Pooja K R
Natural Join Using Map Reduce
Dr. Pooja K R
Natural Join Using Map Reduce
Dr. Pooja K R
Natural Join Using Map Reduce
Dr. Pooja K R
Projection in MapReduce
from t those components whose attributes are not in S. Output the keyvalue
●The Reduce Function: For each key t′ produced by any of the Map tasks,
there will be one or more key-value pairs (t′, t′). The Reduce function turns (t′,
[t′, t′, . . . , t′]) into (t′, t′), so it produces exactly one pair (t′, t′) for this key.
Dr. Pooja K R
Projection in MapReduce
● Projection: πS(R) –
○ Given a subset S of relation R attributes.
○ Produce in output a relation containing only tuples for the attributes in S.
Dr. Pooja K R
Projection in MapReduce
● Similar process to selection.
○ But, projection may cause same tuple to appear several times !
● A MapReduce implementation of πS(R)
○ Map: - For each tuple t in R, construct a tuple t’ by eliminating those
components whose attributes are not in S - Emit a key/value pair (t’, t’).
○ Reduce: - For each key produced by any of the Map tasks, fetch t′, [t′, ···
, t′] - Emit a key/value pair (t’, t’)
Dr. Pooja K R
Union
●Suppose relations R and S have the same schema.
●Map tasks will be assigned chunks from either R or S; it doesn’t matter which.
●The Map tasks don’t really do anything except pass their input tuples as key-
○The Map Function: Turn each input tuple t into a key-value pair (t, t).
○The Reduce Function: Associated with each key t there will be either one or
●However, the Reduce function must produce a tuple only if both relations have the
tuple. If the key t has a list of two values [t, t] associated with it, then the Reduce
●However, if the value-list associated with key t is just [t], then one of R and S is
●The Map Function: Turn each tuple t into a key-value pair (t, t).
●The Reduce Function: If key t has value list [t, t], then produce (t, t). Otherwise,
Dr. Pooja K R
Difference
●The Map Function: For a tuple t in R, produce key-value pair (t,R), and for a
Note that the intent is that the value is the name of R or S (or better, a single
bit indicating whether the relation is R or S), not the entire relation.
●The Reduce Function: For each key t, if the associated value list is [R], then
Dr. Pooja K R
https://medium.com/swlh/relational-operations-using-mapreduce-f49e8bd14e31
Dr. Pooja K R
Thank You!!!
([email protected])
Dr. Pooja K R