0% found this document useful (0 votes)
12 views107 pages

Chapter 2_Introduction to MapReduce_new (1)

Uploaded by

thedeveloper333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views107 pages

Chapter 2_Introduction to MapReduce_new (1)

Uploaded by

thedeveloper333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Introduction To MapReduce

AIDS – B.E – BDA


Dr. Pooja K Revankar
Assistant Professor,
Dept. of Computer Science and Engg.,
SIES Graduate School of Technology

Dr. Pooja K R
Traditional Way of Parallel & Distributed Processing

Dr. Pooja K R
SOLUTION IS MAP REDUCE FRAMEWORK
1. MAPPER-Software for doing the assigned task after organizing the
data blocks imported using keys.
2. REDUCER-Software for reducing the mapped data using the
aggregation.
3. AGGREGATION-Groups the values for multiple rows together to
result a single value of more significant meaning or measurement.
4. QUERYING FUNCTION-Finding best student of class.

Dr. Pooja K R
Features of Map Reduce
1. Provides automatic parallelization and distribution of computation
based on several processors.
2. Processes data stored on distributed clusters of DataNodes and
racks.
3. Provides scalability for usages of large number of servers.
4. Provides MapReduce batch-oriented programming model in
Hadoop version 1.
5. Provides additional processing modes in Hadoop 2 YARN-based
system and enables required parallel processing of 3V
characteristics data.

Dr. Pooja K R
What is MapReduce?
● MapReduce is a programming framework .
● It allows us to perform distributed and parallel processing on large
data sets in a distributed environment.

Dr. Pooja K R
Main Phases of Map reduce
● Map: each worker node applies the map function to the local data,
and writes the output to a temporary storage. A master node
ensures that only one copy of the redundant input data is
processed.
● Shuffle: worker nodes redistribute data based on the output keys
(produced by the map function), such that all data belonging to one
key is located on the same worker node.
● Reduce: worker nodes now process each group of output data, per
key, in parallel.

Dr. Pooja K R
MapReduce Job Execution Flow

Dr. Pooja K R
Mapper in Hadoop MapReduce

Dr. Pooja K R
Mapper in Hadoop MapReduce
● Hadoop Mapper task processes each input record and it generates a
new <key, value> pairs.
● The <key, value> pairs can be completely different from the input
pair.
● In mapper task, the output is the full collection of all these <key,
value> pairs.

Dr. Pooja K R
Reducer in Hadoop MapReduce

Dr. Pooja K R
Reducer in Hadoop MapReduce
1. Input to reducer will be output of mapper <key,value> pair.
2. Hadoop Reducer takes a set of an intermediate key-value pair
produced by the mapper as the input and runs a Reducer function
on each of them.
3. One can aggregate, filter, and combine this data (key, value) in a
number of ways for a wide range of processing.
4. Reducer first processes the intermediate values for particular key
generated by the map function and then generates the output (zero
or more key-value pair).

Dr. Pooja K R
Shuffling and Sorting

Dr. Pooja K R
https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2017/01/hadoop-
mapreduce-data-flow-execution-1.gif

Dr. Pooja K R
Word Count using MapReduce Algorithm

Dr. Pooja K R
Map REDUCE EXAMPLES

Dr. Pooja K R
Daemons used in Map Reduce programming

Dr. Pooja K R
JobTracker
• JobTracker process runs on a separate node and not usually on a DataNode.
• JobTracker is an essential Daemon for MapReduce execution in MRv1. It is
replaced by ResourceManager/ApplicationMaster in MRv2.
• JobTracker receives the requests for MapReduce execution from the client.
• JobTracker talks to the NameNode to determine the location of the data.
• JobTracker finds the best TaskTracker nodes to execute tasks based on the
data locality (proximity of the data) and the available slots to execute a task on
a given node.
• JobTracker monitors the individual TaskTrackers and the submits back the
overall status of the job back to the client.
• JobTracker process is critical to the Hadoop cluster in terms of MapReduce
execution.
• When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.

Dr. Pooja K R
TaskTracker
• TaskTracker runs on DataNode. Mostly on all DataNodes.
• TaskTracker is replaced by Node Manager in MRv2.
• Mapper and Reducer tasks are executed on DataNodes administered by
TaskTrackers.
• TaskTrackers will be assigned Mapper and Reducer tasks to execute by
JobTracker.
• TaskTracker will be in constant communication with the JobTracker signalling
the progress of the task in execution.
• TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to
another node.

Dr. Pooja K R
Matrix-Vector Multiplication by MapReduce

●Created to execute very large matrix-vector multiplications

●When ranking of Web pages that goes on at search engines, n is in the tens of

billions.

●Page Rank- iterative algorithm

●Also, useful for simple (memory-based) recommender systems

Dr. Pooja K R
Problem Statement
Given,

●n × n matrix M, whose element in row i and column j will be denoted 𝑚𝑖𝑗 .

●A vector v of length n.

●Assume that

○The row-column coordinates of each matrix element will be discoverable,

either from its position in the file, or because it is stored with explicit

coordinates, as a triple (i, j, 𝑚𝑖𝑗).

●the position of element 𝑣𝑗 in the vector v will be discoverable in the analogous


Dr. Pooja K R
Algorithm for Map Function

Dr. Pooja K R
Algorithm for Reduce Function:

Dr. Pooja K R
Computing the mapper for Matrix A
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have 2 values 1 & 2,
# each case can have 2 further values of j=1 and j=2.
#Substituting all values in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))


j=2 ((1, 1), (A, 2, 2))

i=2 j=1 ((2, 1), (A, 1, 3))


j=2 ((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))


j=2 ((1, 2), (A, 2, 2))

i=2 j=1 ((2, 2), (A, 1, 3))


j=2 ((2, 2), (A, 2, 4))

Dr. Pooja K R
Computing the mapper for Matrix

i=1 j=1 k=1 ((1, 1), (B, 1, 5))


k=2 ((1, 2), (B, 1, 6))

j=2 k=1 ((1, 1), (B, 2, 7))


k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))


k=2 ((2, 2), (B, 1, 6))

j=2 k=1 ((2, 1), (B, 2, 7))


k=2 ((2, 2), (B, 2, 8))

Dr. Pooja K R
The formula for Reducer is:

Reducer(k, v)=(i, k)=>Make sorted Alist and Blist


(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)

Dr. Pooja K R
Computing the reducer:

# We can observe from Mapper computation


# that 4 pairs are common (1, 1), (1, 2), (2, 1) and (2, 2)
# Make a list separate for Matrix A & B with adjoining values taken from
Mapper step.

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)

Dr. Pooja K R
Computing the reducer:
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)

From (i), (ii), (iii) and (iv) we conclude that


((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Dr. Pooja K R
Solution

●Case I

n is large, but not so large that vector v cannot fit in main memory.

●Case II

n is large to fit into main memory.

Dr. Pooja K R
The MapReduce Phase

Dr. Pooja K R
Case I - n is large, but not so large that vector v cannot fit in
main memory
●The Map Function:
○ Map function is written to apply to one element of M.
○ v is first read to computing node executing a Map task and is
available for all applications of the Map function at this compute
node.
○ Each Map task will operate on a chunk of the matrix M.
○ From each matrix element 𝑚𝑖𝑗 it produces the key value pair

Dr. Pooja K R
The Reduce Function:
• The Reduce function simply sums all the values
associated with a given key i.

• The result will be a pair (i,𝑥𝑖 ).

Dr. Pooja K R
Map(I,j,Mij)

1 4 7 1

2 5 8 2 =

3 6 9 3

Vij Xi
Mij

Dr. Pooja K R
Map Function

1 (0,0,1) 1 (0,1)

2 (0,1,4) 2 (0,8)

3 (0,2,7) 3 (0,21)

4 (1,0,2) 1 (1,2)

5 (1,1,5) 2 (1, 10)

6 (1,2,8) 3 (1,24)

7 (2,0,3) 1 (2, 3 )

8 (2, 1 ,6) 2 (2,12)


9 (2, 2, 9) 3 (2,27)

Dr. Pooja K R
Shuffle and Reduce

30 36 42

Dr. Pooja K R
Output:

30

36

42

Dr. Pooja K R
Case 2: n is large to fit into main memory
● v should be stored in computing nodes used for the Map task.

● Divide the matrix into vertical stripes of equal width and divide the vector
into an equal number of horizontal stripes, of the same height.

● Our goal is to use enough stripes so that the portion of the vector in one
stripe can fit conveniently into main memory at a compute node.

Dr. Pooja K R
Division of a matrix and vector into five stripes

Dr. Pooja K R
Continue…..
● The ith stripe of the matrix multiplies only components from the ith stripe of
the vector.
● Divide the matrix into one file for each stripe, and do the same for the
vector.
● Each Map task is assigned a chunk from one of the stripes of the matrix
and gets the entire corresponding stripe of the vector.

Dr. Pooja K R
Dr. Pooja K R
The reduce( ) step in the MapReduce Algorithm for matrix multiplication

Dr. Pooja K R
•The input information of the reduce( ) step
(function) of the MapReduce algorithm are:
•One row vector from matrix A.
•One column vector from matrix B.

Dr. Pooja K R
Dr. Pooja K R
The reduce( ) function will compute

Dr. Pooja K R
Preprocessing for the map( ) function
•The map( ) function (really) only has one input stream:of the format ( key , value )
i i

Dr. Pooja K R
Pre-processing used for matrix multiplication:

Dr. Pooja K R
Overview of the MapReduce Algorithm for Matrix Multiplication

Dr. Pooja K R
Dr. Pooja K R
•The map( ) will duplicate N times as follows
where N = # rows in matrix A (= # columns in matrix B)

Dr. Pooja K R
Dr. Pooja K R
Dr. Pooja K R
Dr. Pooja K R
Relational- Algebra Operations
●There are a number of operations on large-scale data that are used in database queries.

●Many traditional database applications involve retrieval of small amounts of data, even though the

database itself may be large.

●For example, a query may ask for the bank balance of one particular account. Such queries are not

useful applications of MapReduce.

●There are several standard operations on relations, often referred to as relational algebra, that are used

to implement queries.

●The queries themselves usually are written in SQL.

Dr. Pooja K R
Relational- Algebra Operations

●Selections

●Projections

●Union

●Intersection

●Difference

Dr. Pooja K R
Representation of a table in HDFS

Dr. Pooja K R
Selection in MapReduce
●Selection can be done most conveniently in the map portion alone, although they could also be done

in the reduce portion alone.

●Here is a MapReduce implementation of selection σC(R).

●The Map Function:

○For each tuple t in R, test if it satisfies C.

■If so, produce the key-value pair (t,t). That is, both the key and value are t.

●The Reduce Function: The Reduce function is the identity. It simply passes each key-value pair to

the output.

Dr. Pooja K R
Selection in Map Reduce
● Selection: σC(R)
○ Apply condition C to each tuple of relation R.
○ Produce in output a relation containing only tuples that satisfy C.

Dr. Pooja K R
Selection in Map Reduce

For our example we will do Selection(B <= 3). Select all the rows where value of B is less than or
equal to 3.

Dr. Pooja K R
Selection in Map Reduce

Dr. Pooja K R
Selection in Map Reduce

Based on number or reduce workers (2 in our case). The files for reduce workers on map workers
will look like:

Dr. Pooja K R
Output of Selection

6 3

Dr. Pooja K R
Projection Using Map Reduce

• Map Function: For each row r in the table produce a key value pair r', r’, where r'
only contains the columns which are wanted in the projection.

• Reduce Function: The reduce function will get outputs in the form of r' :[r', r', r', r',
...]. As after removing some columns the output may contain duplicate rows. So it
will just take the value at 0th index, getting rid of duplicates.

Dr. Pooja K R
Projection in MapReduce

●Projection is performed similarly to selection, because projection may cause

the same tuple to appear several times, the Reduce function must eliminate

duplicates.

●We may compute as follows.

Dr. Pooja K R
computing projection(A, B)

Dr. Pooja K R
computing projection(A, B)

Dr. Pooja K R
computing projection(A, B)

Dr. Pooja K R
computing projection(A, B)

Dr. Pooja K R
computing projection(A, B)

Dr. Pooja K R
computing projection(A, B)

Dr. Pooja K R
Union Using Map Reduce

• Map Function: For each row r generate key-value pair (r, r) .

• Reduce Function: With each key there can be one or two values (As we
don’t have duplicate rows), in either case just output first value.

Dr. Pooja K R
Union Using Map Reduce

Dr. Pooja K R
Union Using Map Reduce

Dr. Pooja K R
Union Using Map Reduce

Dr. Pooja K R
Union Using Map Reduce

Dr. Pooja K R
Union Using Map Reduce

Dr. Pooja K R
Union Using Map Reduce

Dr. Pooja K R
Intersection Using Map Reduce

• Map Function: For each row r generate key-value pair (r, r) (Same as
union).

• Reduce Function: With each key there can be one or two values (As we
don’t have duplicate rows), in case we have length of list as 2 we output
first value else we output nothing.

Dr. Pooja K R
Intersection Using Map Reduce

Dr. Pooja K R
Intersection Using Map Reduce

Dr. Pooja K R
Difference Using Map Reduce

• Map Function: For each row r create a key-value pair (r, T1) if row is from
table 1 else product key-value pair (r, T2).

• Reduce Function: Output the row if and only if the value in the list is T1 ,
otherwise output nothing.

Dr. Pooja K R
Difference Using Map Reduce

Dr. Pooja K R
Difference Using Map Reduce

Dr. Pooja K R
Difference Using Map Reduce

Dr. Pooja K R
Difference Using Map Reduce

Dr. Pooja K R
Difference Using Map Reduce

Dr. Pooja K R
Difference Using Map Reduce

Dr. Pooja K R
Grouping and Aggregation Using Map Reduce

• Map Function: For each row in the table, take the attributes using which grouping is
to be done as the key, and value will be the ones on which aggregation is to be
performed.

• For example, If a relation has 4 columns A, B, C, D and we want to group by A, B


and do an aggregation on C we will make (A, B) as the key and C as the value.

• Reduce Function: Apply the aggregation operation (sum, max, min, avg, …) on the
list of values and output the result.

Dr. Pooja K R
Grouping and Aggregation Using Map Reduce

Dr. Pooja K R
Grouping and Aggregation Using Map Reduce

Dr. Pooja K R
Grouping and Aggregation Using Map Reduce

Dr. Pooja K R
Grouping and Aggregation Using Map Reduce

Dr. Pooja K R
Grouping and Aggregation Using Map Reduce

Dr. Pooja K R
Output of group by (A, B) sum(C)

Dr. Pooja K R
Natural Join Using Map Reduce

• Map Function: For two relations Table 1(A, B) and Table 2(B, C) the map
function will create key-value pairs of form b: [(T1, a)] for table 1 where T1
represents the fact that the value a came from table 1, for table 2 key-
value pairs will be of the form b: [(T2, c)].

• Reduce Function: For a given key b construct all possible combinations


for the values where one value is from table T1 and the other value is from
table T2. The output will consist of key-value pairs of form b: [(a, c)] which
represent one row a, b, c for the output table.

Dr. Pooja K R
Natural Join Using Map Reduce

Dr. Pooja K R
Natural Join Using Map Reduce

Dr. Pooja K R
Natural Join Using Map Reduce

Dr. Pooja K R
Natural Join Using Map Reduce

Dr. Pooja K R
Natural Join Using Map Reduce

Dr. Pooja K R
Natural Join Using Map Reduce

Dr. Pooja K R
Projection in MapReduce

●The Map Function: For each tuple t in R, construct a tuple t′ by eliminating

from t those components whose attributes are not in S. Output the keyvalue

pair (t′, t′).

●The Reduce Function: For each key t′ produced by any of the Map tasks,

there will be one or more key-value pairs (t′, t′). The Reduce function turns (t′,

[t′, t′, . . . , t′]) into (t′, t′), so it produces exactly one pair (t′, t′) for this key.

Dr. Pooja K R
Projection in MapReduce
● Projection: πS(R) –
○ Given a subset S of relation R attributes.
○ Produce in output a relation containing only tuples for the attributes in S.

Dr. Pooja K R
Projection in MapReduce
● Similar process to selection.
○ But, projection may cause same tuple to appear several times !
● A MapReduce implementation of πS(R)
○ Map: - For each tuple t in R, construct a tuple t’ by eliminating those
components whose attributes are not in S - Emit a key/value pair (t’, t’).

○ Reduce: - For each key produced by any of the Map tasks, fetch t′, [t′, ···
, t′] - Emit a key/value pair (t’, t’)

Dr. Pooja K R
Union
●Suppose relations R and S have the same schema.

●Map tasks will be assigned chunks from either R or S; it doesn’t matter which.

●The Map tasks don’t really do anything except pass their input tuples as key-

value pairs to the Reduce tasks.

●The latter need only eliminate duplicates as for projection.

○The Map Function: Turn each input tuple t into a key-value pair (t, t).

○The Reduce Function: Associated with each key t there will be either one or

two values. Produce output (t, t) in either case.


Dr. Pooja K R
Intersection
To compute the intersection, we can use the same Map function.

●However, the Reduce function must produce a tuple only if both relations have the

tuple. If the key t has a list of two values [t, t] associated with it, then the Reduce

task for t should produce (t, t).

●However, if the value-list associated with key t is just [t], then one of R and S is

missing t, so we don’t want to produce a tuple for the intersection.

●The Map Function: Turn each tuple t into a key-value pair (t, t).

●The Reduce Function: If key t has value list [t, t], then produce (t, t). Otherwise,
Dr. Pooja K R
Difference

●The Map Function: For a tuple t in R, produce key-value pair (t,R), and for a

tuple t in S, produce key-value pair (t,S).

Note that the intent is that the value is the name of R or S (or better, a single

bit indicating whether the relation is R or S), not the entire relation.

●The Reduce Function: For each key t, if the associated value list is [R], then

produce (t,t). Otherwise, produce nothing.

Dr. Pooja K R
https://medium.com/swlh/relational-operations-using-mapreduce-f49e8bd14e31

Dr. Pooja K R
Thank You!!!
([email protected])

Dr. Pooja K R

You might also like