0% found this document useful (0 votes)
5 views

Map Reduce Programming

Uploaded by

newt67710
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Map Reduce Programming

Uploaded by

newt67710
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Map Reduce Programming

Contents
• Map-Reduce Programming
• Exercises
• Mappers & Reducers
• Hadoop combiners
• Hadoop partitioners
Overview
• Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel on
large clusters (thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
• A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
• The framework sorts the outputs of the maps, which are then input to the
reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system.
• The framework takes care of scheduling tasks, monitoring them and re-
executes the failed tasks.
• Typically the compute nodes and the storage nodes are the same, that
is, the MapReduce framework and the Hadoop Distributed File System
are running on the same set of nodes.
• This configuration allows the framework to effectively schedule tasks on
the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
• The MapReduce framework consists of a single master Resource
Manager, one worker NodeManager per cluster-node, and
MRAppMaster per application
What is Map Reduce?
Word count Job
Input : Text file
Output : count of words
Hi how are you
Hi how are you? how is your job
how is your job?
how is your family how is your family
how is your sister how is your sister
how is your brother
what is the time now How is your brother
what is the strength of the Hadoop what is the time now
File.txt
what is the strength of the Hadoop
Size :: 500MB
Text Input format
Key Value Text Input Format
Sequence File Input Format
Input file SequenceFileAsTextInput Format

Input split 1 Input split 1 Input split 1 Input split 1

Record reader Record reader Record reader Record reader

Byteoffset , record
Mapper Mapper Mapper Mapper

Because of collection framework , as it doesnot work


on the primitive types, wrapper classes are created.
Collection framework Work with the object to type so
Primitive types Wrapper Class Box Class object of wrapper class is to be created. Similar to Java
int Integer IntWritable as it has introduced wrapper class corresponding to
long Long LongWritable primitive class. Hadoop has introduced Box classes.
float Float FloatWritable For Java conversion from primitive to wrapper is done
double Double DoubleWritable automatically but for Hadoop we need to explicitly
string String StringWritable mention that conversion.
char Character CharWritable Int - new IntWritable(int) and get () method for
etc… etc.. etc.. back
400 MB
Basic Idea
• Issue: Copying data over a network takes time
• Idea: 
• Bring computation to data
• Store files multiple times for reliability
• MapReduce addresses these problems
• Storage Infrastructure – File system 
• Google: GFS.
• Hadoop: HDFS
• Programming model  NEXT

• MapReduce

08/07/2024 9
Recall HashTable

Hash Function maps input keys to buckets.


08/07/2024 10
From HashTable to Distributed Hash Table (DHT)

Node-1

Node-2

...

Node-n

Disibuted Hash Function maps input keys to physical nodes.


08/07/2024 11
From DHT to MapReduce
Node-1

Node-2

...

Node-n

Map() Reduce()
08/07/2024 12
The MapReduce programming model
• MapReduce is a distributed programming model
• In many circles, considered the key building block for much of Google’s data analysis
• A programming language built on it: Sawzall,
http://labs.google.com/papers/sawzall.html
• … Sawzall has become one of the most widely used programming languages at Google. … [O]n one dedicated Workqueue
cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While
running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning
some portion of the job. The jobs read a total of 3.2x1015 bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB).
• Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s Dryad
• Cloned in open source: Hadoop,
http://hadoop.apache.org/

08/07/2024 13
The MapReduce programming model
• Simple distributed functional programming primitives
• Modeled after Lisp primitives:
• map (apply function to all items in a collection) and
• reduce (apply function to set of items with a common key)
• We start with:
• A user-defined function to be applied to all data,
map: (key,value)  (key, value)
• Another user-specified operation
reduce: (key, {set of values})  result
• A set of n nodes, each with data
• All nodes run map on all of their data, producing new data with keys
• This data is collected by key, then shuffled, and finally reduced
• Dataflow is through temp files on GFS

08/07/2024 14
Simple example: Word count
map(String key, String value) { reduce(String key, Iterator values) {
// key: document name, line no // key: a word
// value: contents of line // values: a list of counts
for each word w in value: int result = 0;
emit(w, "1") for each v in values:
} result += ParseInt(v);
emit(key, result)
}

• Goal: Given a set of documents, count how often each


word occurs
• Input: Key-value pairs (document:lineNumber, text)
• Output: Key-value pairs (word, #occurrences)
• What should be the intermediate key-value pairs?

08/07/2024 15
Simple example: Word count Key range the node
is responsible for
(apple, {1, 1, 1}) (apple, 3)
Mapper(apple, 1)(apple, 1) (apple, 1)(an, {1, 1}) Reducer (an, 2)
(1-2) (an, 1)(an, 1) (A-G) (because, 1)
(1, the apple)
(because, 1) (because, {1}) (green, 1)
(2, is an apple) (green, 1)(green, {1})
(3, not an orange) Mapper (is, 1)(is, 1) (is, {1, 1}) Reducer (is, 2)
(3-4) (not, 1)(not, 1) (not, {1, 1}) (H-N) (not, 2)
(4, because the)
(5, orange) (orange, 3)
(orange, 1)(orange, 1) (orange, 1) (orange, {1, 1, 1})
(6, unlike the apple)
Mapper Reducer (the, 3)
(5-6)
(the, 1)(the, 1)(the, 1) (the, {1, 1, 1}) (O-U)
(unlike, 1) (unlike, {1}) (unlike, 1)
(7, is orange)
(8, not green)
Mapper Reducer
(7-8) (V-Z)

1 Each mapper 2 The mappers 3 Each KV-pair output 4 The reducers 5 The reducers
receives some process the by the mapper is sent sort their input process their
of the KV-pairs KV-pairs to the reducer that is by key input one group
as input one by one responsible for it and group it at a time
08/07/2024 16
MapReduce: A Diagram

08/07/2024 17
MapReduce: In Parallel

08/07/2024 18
Steps of MapReduce
3 steps of MapReduce
• Sequentially read a lot of data
• Map: Extract something you care about
• Group by key: Sort and shuffle
• Reduce: Aggregate, summarize, filter or transform
• Output the result

08/07/2024 19
MapReduce Examples #example1
• word count using MapReduce
• map(keyin, valuein,keyout,valueout):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
(hi,1)(how,1)(hi,1)(you,1)
• reduce(keyin, values):
// key: a word; values: an iterator over counts
result = 0
(hi,(1,1))
for each count v in values:
result += v
emit(key, result)
counting words of different
lengths#example2
• Input file :
hi how are you?
Welcome to Nirma University.
• Output file :
2: 2 , 3:3 , 5:1, 7:1 , 10:1
How??
hi:2, how:3, are:3, you:3,welcome:7,to:2,Nirma:5,University:10
• Mapper Task :
• Emit(2,hi),(2,to)(3,how)………
• Reducer Task :
• (2:[hi,to])
• (3:[how,are,you])
• ……
Find out the word length histogram#example3
• particular document is given and then we have to, find out that how
many big medium and small words are
• appearing in the particular document and this becomes the word
length histogram.
Big : Yellow : 10+
Medium: Red : 5 to 9
Small: Blue : 2to 4
Tiny: pink : 1
Inverted Index #example4
• Finding given word from search engine
• Input:
Tweet1, “I love pancakes for breakfast”
Tweet2, “I dislike pancakes”
Tweet3, “What should I eat for breakfast?”
Tweet4, “I love to eat”
• Output:
Pancakes(tweet1,tweet2)
Breakfast(tweet1,tweet3)
eat(tweet3,tweet4)
love(tweet1,tweet4)
• Find out Mapper and Reducer
Matrix Multiplication#example5

• Input Format
Map Function
Sum of even numbers and sum of odd
numbers of squares of a given number list
Sum of even numbers , sum of odd numbers and sum
of prime numbers of squares of a given number list
In Detail
• Hadoop Mapper
• Hadoop Reducer
• Key-Value Pairs
• Input Format
• Record Reader
• Partitioner
Mapper in Hadoop Map-Reduce
• How is key value pair generated in Hadoop?
1. Input Split
2. Record Reader
InputSplit
• InputSplit in Hadoop MapReduce is the logical representation of data. It
describes a unit of work that contains a single map task in a MapReduce
program.
• As a user, we don’t need to deal with InputSplit directly, because they are
created by an InputFormat
• mapred.min.split.size parameter in mapred-site.xml we can control this value
or by overriding the parameter in the Job object used to submit a particular
MapReduce job.
• The client (running the job) can calculate the splits for a job by calling
‘getSplit()’, and then sent to the application master, which uses their
storage locations to schedule map tasks that will process them on the
cluster.
• Then, map task passes the split to the createRecordReader() method
on InputFormat to get RecordReader for the split and RecordReader
generate record (key-value pair), which it passes to the map function.
public abstract class InputFormat<K, V>
{
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException,
InterruptedException;
}
What is Hadoop InputFormat?
• The InputFormat class is one of the fundamental classes in the Hadoop
MapReduce framework which provides the following functionality:
1. The files or other objects that should be used for input is selected by the
InputFormat.
2. InputFormat defines the Data splits, which defines both the size of
individual Map tasks and its potential execution server.
3. InputFormat defines the RecordReader, which is responsible for reading
actual records from the input files.
Types of InputFormat in MapReduce
1. FileInputFormat

• It is the base class for all file-based InputFormats.


• FileInputFormat also specifies input directory which has data files
location.
• When we start a MapReduce job execution, FileInputFormat provides
a path containing files to read.
• This InpuFormat will read all files. Then it divides these files into one
or more InputSplits.
2. TextInputFormat

• It is the default InputFormat. This InputFormat treats each line of


each input file as a separate record. It performs no parsing.
TextInputFormat is useful for unformatted data or line-based records
like log files. Hence,
• Key – It is the byte offset of the beginning of the line within the file
(not whole file one split). So it will be unique if combined with the file
name.
• Value – It is the contents of the line. It excludes line terminators.
3. KeyValueTextInputFormat

• It is similar to TextInputFormat. This InputFormat also treats each line


of input as a separate record.
• While the difference is that TextInputFormat treats entire line as the
value, but the KeyValueTextInputFormat breaks the line itself into key
and value by a tab character (‘/t’). Hence,
• Key – Everything up to the tab character.
• Value – It is the remaining part of the line after tab character.
4. SequenceFileInputFormat

• It is an InputFormat which reads sequence files.


• Sequence files are binary files.
• These files also store sequences of binary key-value pairs. These are
block-compressed and provide direct serialization and deserialization
of several arbitrary data.
• Key & Value both are user-defined.
5. SequenceFileAsTextInputFormat

• It is the variant of SequenceFileInputFormat. This format converts the


sequence file key values to Text objects. So, it performs conversion by
calling ‘tostring()’ on the keys and values.
• Hence, SequenceFileAsTextInputFormat makes sequence files suitable
input for streaming.
6. SequenceFileAsBinaryInputFormat
By using SequenceFileInputFormat we can extract the sequence
file’s keys and values as an opaque binary object.
7. NlineInputFormat

• It is another form of TextInputFormat where the keys are byte offset of the
line. And values are contents of the line. So, each mapper receives a variable
number of lines of input with TextInputFormat and KeyValueTextInputFormat.
• The number depends on the size of the split. Also, depends on the length of
the lines. So, if want our mapper to receive a fixed number of lines of input,
then we use NLineInputFormat.
• N- It is the number of lines of input that each mapper receives.
• By default (N=1), each mapper receives exactly one line of input.
• Suppose N=2, then each split contains two lines. So, one mapper receives the
first two Key-Value pairs. Another mapper receives the second two key-value
pairs.
8. DBInputFormat

• This InputFormat reads data from a relational database, using JDBC. It


also loads small datasets, perhaps for joining with large datasets
from HDFS using MultipleInputs. Hence,
• Key – LongWritables
• Value – DBWritables.
Where to mention?
• Driver code
• job.setInputFormatClass(DBInputFormat.class);
• job.setOutputFormatClass(DBOutputFormat.class);
Record Reader
• The MapReduce RecordReader in Hadoop takes the byte-oriented
view of input, provided by the InputSplit and presents as a record-
oriented view for Mapper.
• Map task passes the split to the createRecordReader() method on
InputFormat in task tracker to obtain a RecordReader for that split.
The RecordReader load’s data from its source and converts into key-
value pairs suitable for reading by the mapper.
Types of Hadoop Record Reader in
MapReduce
i. LineRecordReader
ii. SequenceFileRecordReader

• Maximum size for a Single Record


• conf.setInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
• A line with a size greater than this maximum value (default is
2,147,483,647) will be ignored.
Hadoop Record Writer
• Record Writer writes these output key-value pairs from the Reducer
phase to output files.
• TextOutputFormat
• SequenceFileOutputFormat
• SequenceFileAsBinaryOutputFormat
• MapFileOutputFormat
• MultipleOutputs
• LazyOutputFormat
• DBOutputFormat
Java Programs for Word count Problem
• Driver Code
• Mapper Code
• Reducer Code
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
//job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
NCDC data
Raw Format of weather data

Mapper Input data

Mapper Output data


Reducer Input data

Reducer Output data


Hadoop/Map Reduce Combiners
• On a large dataset when we run MapReduce job, large chunks of
intermediate data is generated by the Mapper and this intermediate
data is passed on the Reducer for further processing, which leads to
enormous network congestion. MapReduce framework provides a
function known as Hadoop Combiner that plays a key role in reducing
network congestion
• The combiner in MapReduce is also known as ‘Mini-reducer’. The
primary job of Combiner is to process the output data from the
Mapper, before passing it to Reducer. It runs after the mapper and
before the Reducer and its use is optional
MapReduce program without Combiner
MapReduce program with Combiner
Advantages of MapReduce Combiner
• Hadoop Combiner reduces the time taken for data transfer between
mapper and reducer.
• It decreases the amount of data that needed to be processed by the
reducer.
• The Combiner improves the overall performance of the reducer.

Disadvantages of MapReduce Combiner


• MapReduce jobs cannot depend on the Hadoop combiner execution
because there is no guarantee in its execution.
• In the local filesystem, the key-value pairs are stored in the Hadoop and
run the combiner later which will cause expensive disk IO.
Where to define it?

In Driver class
• job. Set Combiner Class(ReduceClass.class);
Separate java file

• public class Combiners Hadoop {


Hadoop Partitioner / MapReduce Partitioner
• Partitioning of the keys of the intermediate map output is controlled
by the Partitioner.
Need of Hadoop MapReduce Partitioner

• MapReduce job takes an input data set and produces the list of the
key-value pair which is the result of map phase in which input data is
split and each task processes the split and each map, output the list of
key-value pairs. Then, the output from the map phase is sent to
reduce task which processes the user-defined reduce function on map
outputs.

• Before reduce phase, partitioning of the map output take place on the
basis of the key and sorted.
• This partitioning specifies that all the values for each key are grouped
together and make sure that all the values of a single key go to the
same reducer, thus allows even distribution of the map output over
the reducer.
• Partitioner in Hadoop MapReduce redirects the mapper output to the
reducer by determining which reducer is responsible for the particular
key.
• The Default Hadoop partitioner in Hadoop MapReduce is Hash
Partitioner which computes a hash value for the key and assigns the
partition based on this result.
How many Partitioner

• The total number of Partitioners that run in Hadoop is equal to the


number of reducers i.e. Partitioner will divide the data according to
the number of reducers which is set by JobConf.setNumReduceTasks()
method.
• Thus, the data from single Partitioner is processed by a single reducer.
And Partitioner is created only when there are multiple reducers
Poor Partitioning in Hadoop MapReduce
• If in data input one key appears more than any other key then
1. The key appearing more will be sent to one partition.
2. All the other key will be sent to partitions according to their hashCode().
• If hashCode() method does not uniformly distribute other keys data
over partition range, then data will not be evenly sent to reducers.
• Poor partitioning of data means that some reducers will have more
data input than other i.e. they will have more work to do than other
reducers. So, the entire job will wait for one reducer to finish its extra-
large share of the load.
• we can create Custom partitioner, which allows sharing workload
uniformly across different reducers.
• Partitioner provides the getPartition() method that you can implement
yourself if you want to declare the custom partition for your job.
• public static class MyPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks==0)
return 0;
if(key.equals(new Text(“Male”)) )
return 0;
if(key.equals(new Text(“Female”)))
return 1;
}
}
How to set number of reducers?
• By default no of reducer=1
• If you mention JobConf.setNumReduceTasks(0) then no of reducers
are 0 and process will be executed only using mappers. No sorting&
shuffling will applied
• Methods to set no of reducers
1. Command line (bin/hadoop jar -Dmapreduce.job.maps=5 yourapp.jar..)
mapred.map.tasks --> mapreduce.job.maps
mapred.reduce.tasks --> mapreduce.job.reduces
2. In the code, one can configure JobConf variables.
job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers

You might also like