Map Reduce Programming
Map Reduce Programming
Contents
• Map-Reduce Programming
• Exercises
• Mappers & Reducers
• Hadoop combiners
• Hadoop partitioners
Overview
• Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
• A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
• The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system.
• The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
• Typically the compute nodes and the storage nodes are the same, that
is, the MapReduce framework and the Hadoop Distributed File System
are running on the same set of nodes.
• This configuration allows the framework to effectively schedule tasks on
the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
• The MapReduce framework consists of a single master Resource
Manager, one worker NodeManager per cluster-node, and
MRAppMaster per application
What is Map Reduce?
Word count Job
Input : Text file
Output : count of words
Hi how are you
Hi how are you? how is your job
how is your job?
how is your family how is your family
how is your sister how is your sister
how is your brother
what is the time now How is your brother
what is the strength of the Hadoop what is the time now
File.txt
what is the strength of the Hadoop
Size :: 500MB
Text Input format
Key Value Text Input Format
Sequence File Input Format
Input file SequenceFileAsTextInput Format
Byteoffset , record
Mapper Mapper Mapper Mapper
• MapReduce
08/07/2024 9
Recall HashTable
Node-1
Node-2
...
Node-n
Node-2
...
Node-n
Map() Reduce()
08/07/2024 12
The MapReduce programming model
• MapReduce is a distributed programming model
• In many circles, considered the key building block for much of Google’s data analysis
• A programming language built on it: Sawzall,
http://labs.google.com/papers/sawzall.html
• … Sawzall has become one of the most widely used programming languages at Google. … [O]n one dedicated Workqueue
cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While
running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning
some portion of the job. The jobs read a total of 3.2x1015 bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB).
• Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s Dryad
• Cloned in open source: Hadoop,
http://hadoop.apache.org/
08/07/2024 13
The MapReduce programming model
• Simple distributed functional programming primitives
• Modeled after Lisp primitives:
• map (apply function to all items in a collection) and
• reduce (apply function to set of items with a common key)
• We start with:
• A user-defined function to be applied to all data,
map: (key,value) (key, value)
• Another user-specified operation
reduce: (key, {set of values}) result
• A set of n nodes, each with data
• All nodes run map on all of their data, producing new data with keys
• This data is collected by key, then shuffled, and finally reduced
• Dataflow is through temp files on GFS
08/07/2024 14
Simple example: Word count
map(String key, String value) { reduce(String key, Iterator values) {
// key: document name, line no // key: a word
// value: contents of line // values: a list of counts
for each word w in value: int result = 0;
emit(w, "1") for each v in values:
} result += ParseInt(v);
emit(key, result)
}
08/07/2024 15
Simple example: Word count Key range the node
is responsible for
(apple, {1, 1, 1}) (apple, 3)
Mapper(apple, 1)(apple, 1) (apple, 1)(an, {1, 1}) Reducer (an, 2)
(1-2) (an, 1)(an, 1) (A-G) (because, 1)
(1, the apple)
(because, 1) (because, {1}) (green, 1)
(2, is an apple) (green, 1)(green, {1})
(3, not an orange) Mapper (is, 1)(is, 1) (is, {1, 1}) Reducer (is, 2)
(3-4) (not, 1)(not, 1) (not, {1, 1}) (H-N) (not, 2)
(4, because the)
(5, orange) (orange, 3)
(orange, 1)(orange, 1) (orange, 1) (orange, {1, 1, 1})
(6, unlike the apple)
Mapper Reducer (the, 3)
(5-6)
(the, 1)(the, 1)(the, 1) (the, {1, 1, 1}) (O-U)
(unlike, 1) (unlike, {1}) (unlike, 1)
(7, is orange)
(8, not green)
Mapper Reducer
(7-8) (V-Z)
1 Each mapper 2 The mappers 3 Each KV-pair output 4 The reducers 5 The reducers
receives some process the by the mapper is sent sort their input process their
of the KV-pairs KV-pairs to the reducer that is by key input one group
as input one by one responsible for it and group it at a time
08/07/2024 16
MapReduce: A Diagram
08/07/2024 17
MapReduce: In Parallel
08/07/2024 18
Steps of MapReduce
3 steps of MapReduce
• Sequentially read a lot of data
• Map: Extract something you care about
• Group by key: Sort and shuffle
• Reduce: Aggregate, summarize, filter or transform
• Output the result
08/07/2024 19
MapReduce Examples #example1
• word count using MapReduce
• map(keyin, valuein,keyout,valueout):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
(hi,1)(how,1)(hi,1)(you,1)
• reduce(keyin, values):
// key: a word; values: an iterator over counts
result = 0
(hi,(1,1))
for each count v in values:
result += v
emit(key, result)
counting words of different
lengths#example2
• Input file :
hi how are you?
Welcome to Nirma University.
• Output file :
2: 2 , 3:3 , 5:1, 7:1 , 10:1
How??
hi:2, how:3, are:3, you:3,welcome:7,to:2,Nirma:5,University:10
• Mapper Task :
• Emit(2,hi),(2,to)(3,how)………
• Reducer Task :
• (2:[hi,to])
• (3:[how,are,you])
• ……
Find out the word length histogram#example3
• particular document is given and then we have to, find out that how
many big medium and small words are
• appearing in the particular document and this becomes the word
length histogram.
Big : Yellow : 10+
Medium: Red : 5 to 9
Small: Blue : 2to 4
Tiny: pink : 1
Inverted Index #example4
• Finding given word from search engine
• Input:
Tweet1, “I love pancakes for breakfast”
Tweet2, “I dislike pancakes”
Tweet3, “What should I eat for breakfast?”
Tweet4, “I love to eat”
• Output:
Pancakes(tweet1,tweet2)
Breakfast(tweet1,tweet3)
eat(tweet3,tweet4)
love(tweet1,tweet4)
• Find out Mapper and Reducer
Matrix Multiplication#example5
• Input Format
Map Function
Sum of even numbers and sum of odd
numbers of squares of a given number list
Sum of even numbers , sum of odd numbers and sum
of prime numbers of squares of a given number list
In Detail
• Hadoop Mapper
• Hadoop Reducer
• Key-Value Pairs
• Input Format
• Record Reader
• Partitioner
Mapper in Hadoop Map-Reduce
• How is key value pair generated in Hadoop?
1. Input Split
2. Record Reader
InputSplit
• InputSplit in Hadoop MapReduce is the logical representation of data. It
describes a unit of work that contains a single map task in a MapReduce
program.
• As a user, we don’t need to deal with InputSplit directly, because they are
created by an InputFormat
• mapred.min.split.size parameter in mapred-site.xml we can control this value
or by overriding the parameter in the Job object used to submit a particular
MapReduce job.
• The client (running the job) can calculate the splits for a job by calling
‘getSplit()’, and then sent to the application master, which uses their
storage locations to schedule map tasks that will process them on the
cluster.
• Then, map task passes the split to the createRecordReader() method
on InputFormat to get RecordReader for the split and RecordReader
generate record (key-value pair), which it passes to the map function.
public abstract class InputFormat<K, V>
{
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException,
InterruptedException;
}
What is Hadoop InputFormat?
• The InputFormat class is one of the fundamental classes in the Hadoop
MapReduce framework which provides the following functionality:
1. The files or other objects that should be used for input is selected by the
InputFormat.
2. InputFormat defines the Data splits, which defines both the size of
individual Map tasks and its potential execution server.
3. InputFormat defines the RecordReader, which is responsible for reading
actual records from the input files.
Types of InputFormat in MapReduce
1. FileInputFormat
• It is another form of TextInputFormat where the keys are byte offset of the
line. And values are contents of the line. So, each mapper receives a variable
number of lines of input with TextInputFormat and KeyValueTextInputFormat.
• The number depends on the size of the split. Also, depends on the length of
the lines. So, if want our mapper to receive a fixed number of lines of input,
then we use NLineInputFormat.
• N- It is the number of lines of input that each mapper receives.
• By default (N=1), each mapper receives exactly one line of input.
• Suppose N=2, then each split contains two lines. So, one mapper receives the
first two Key-Value pairs. Another mapper receives the second two key-value
pairs.
8. DBInputFormat
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
In Driver class
• job. Set Combiner Class(ReduceClass.class);
Separate java file
• MapReduce job takes an input data set and produces the list of the
key-value pair which is the result of map phase in which input data is
split and each task processes the split and each map, output the list of
key-value pairs. Then, the output from the map phase is sent to
reduce task which processes the user-defined reduce function on map
outputs.
• Before reduce phase, partitioning of the map output take place on the
basis of the key and sorted.
• This partitioning specifies that all the values for each key are grouped
together and make sure that all the values of a single key go to the
same reducer, thus allows even distribution of the map output over
the reducer.
• Partitioner in Hadoop MapReduce redirects the mapper output to the
reducer by determining which reducer is responsible for the particular
key.
• The Default Hadoop partitioner in Hadoop MapReduce is Hash
Partitioner which computes a hash value for the key and assigns the
partition based on this result.
How many Partitioner