Map Reduce Examples
Map Reduce Examples
Here are a few simple examples of interesting programs that can be easily
expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a supplied pattern.
The reduce function is an identity function that just copies the supplied intermediate
data to the output.
Count of URL Access Frequency: The map function processes logs of web page
requests and outputs (URL, 1). The reduce function adds together all values for the
same URL and emits a (URL, total count) pair.
ReverseWeb-Link Graph: The map function outputs (target, source) pairs for each
link to a target URL found in a page named source. The reduce function
concatenates the list of all source URLs associated with a given target URL and
emits the pair: (target, list(source))
Term-Vector per Host: A term vector summarizes the most important words that
occur in a document or a set of documents as a list of (word, frequency) pairs. The
map function emits a (hostname, term vector) pair for each input document (where
the hostname is extracted from the URL of the document). The reduce function is
passed all per-document term vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits a final (hostname, term
vector) pair.
Same but copied with formatted on next page.
Here are a few simple examples of interesting programs that can be easily expressed as
MapReduce computations.
Distributed Grep: The map function emits a line if it matches a supplied pattern. The
reduce function is an identity function that just copies the supplied intermediate data to the
output.
Count of URL Access Frequency: The map function processes logs of web page
requests and outputs (URL, 1). The reduce function adds together all values for the same
URL and emits a (URL, total count) pair.
Reverse Web-Link Graph: The map function outputs (target, source) pairs for
each link to a target URL found in a page named source. The reduce function
concatenates the list of all source URLs associated with a given target URL and emits
the pair: (target, list(source))
Term-Vector per Host: A term vector summarizes the most important words that occur
in a document or a set of documents as a list of (word, frequency) pairs. The map
function emits a (hostname, term vector) pair for each input document (where the
hostname is extracted from the URL of the document). The reduce function is passed
all per-document term vectors for a given host. It adds these term vectors together,
throwing away infrequent terms, and then emits a final (hostname, term vector)
pair.
Inverted Index: The map function parses each document, and emits a sequence of (word,
document ID) pairs. The reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a (word, list(document ID)) pair. The set of all
output pairs forms a simple inverted index. It is easy to augment this computation to keep
track of word positions.
Distributed Sort: The map function extracts the key from each record, and emits a (key,
record) pair. The reduce function emits all pairs unchanged. This computation depends on
the partitioning facilities described in Section 4.1 and the ordering properties described in
Section 4.2.
I.e., the input keys and values are drawn from a different domain than the output keys and
values. Furthermore, the intermediate keys and values are from the same domain
as the output keys and values.
We have a large collection of text documents in a folder.
Count the frequency of distinct words in the documents.
Map function
Map function operates on every key/value pair of input data and
transforms the data based on the transformation logic provided in the
map function.
Map function always emits an intermediate key/value pair as output
Map( Key1, Value1) -> List ( Key2, Value2 )
For each file
Read each line from the input file
Locate each word
Emit the (word,1) for every word found
//The emitted (word, 1) will form the list that is output from the
Map function
Reduce function takes the list of every key and transforms the data based
on the (aggregation) logic provided in the reduce function. It is similar to
the Aggregate functions in Standard SQL.
For the List(key, value) output from the mapper . Shuffle and Sort the data
by key.
Group by Key and create the list of values for a key.
Reduce function
Reduce ( Key2, List(Value2) ) -> List (Key3, Value3 )
Read each key (word) and list of values (1, 1, 1,..) associated with it.
For each key add the list of values to calculate sum
Emit the word, sum for every word found