0% found this document useful (0 votes)

76 views

Hadoop Developingapps PDF

The document describes an example of using MapReduce to count the number of words in a given text. The mapper emits <word, 1> pairs for each word in the document. The reducer receives all the counts for a single word, sums them, and emits <word, total_count>. Custom data types like posting lists are used to store intermediate outputs efficiently. The driver code handles input/output and configuration for the MapReduce job.

Uploaded by

Alvaro Gómez Rubio

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

Hadoop Developingapps PDF

Uploaded by

Alvaro Gómez Rubio

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Large Scale Data Processing

Hadoop

Dr. Wenceslao PALMA

[email protected]

W. PALMA 1 / 1
Word Count Example

The goal of this example is to count the number of distinct words in a given
text.

class MAPPER
method MAP(docID a, doc d)
for all term t in doc d do
EMIT(term t, count 1)

The MAP method takes an input pair and produces a set of intermediate
<key,value> pairs. Then all the intermediate values associated with the same
intermediate key are grouped by the MapReduce library (shuffle phase).

W. PALMA 2 / 1
Word Count Example

class REDUCER
method REDUCE(term t, counts[c1,c2,...])
sum = 0
for all count c in counts[c1,c2,...] do
sum = sum + c
EMIT(term t, count sum)

The REDUCE method receives an intermediate key and a set of values for
that key merging together these values to form a smaller set of values.

W. PALMA 3 / 1
Word Count Example
Suposse we are give the following input file:
We are not what
we want to be,
but at least
we are not what
we used to be.

The MapReduce job consists of the following:

Map(doc_id, record) --> [(word, 1)]

Reduce(word, [1,1,...]) --> (word, count)

In the map phase the text is tokenized into words. Then a <word,1> pair is formed
with these words.

<we, 1>; <are, 1>; <not, 1>; <what, 1>; ....

Remember that <key, value> pairs are generated in parallel on many machines.
Each task has a little part of the overall Map input
W. PALMA 4 / 1
Word Count Example

Considering our input text, in preparation for the reduce phase all the “we” pairs are
grouped togheter, all the “what” pairs are grouped togheter, etc.

<we, 1> <we, 1> <we, 1> <we, 1> −− > <we, [1,1,1,1]>
<are, 1> <are, 1> −− > <are, [1,1]>
<not, 1> <not, 1> −− > <not, [1,1]>
...

In the reduce phase a reduce function is called once for each key. The reduce phase
also sorts the output into increasing order by key as follows:

<are, 2>; <at, 1>; <be, 2>; <but, 1>; <least, 1>; <not, 2>; <to, 2>;
<used, 1>; <want, 1>; <we, 4>; <what, 2>

Like in the map phase, the reduce phase is also run in parallel. Each machine is
assigned a subset of the keys to work on. The results are stored into a separate file.

W. PALMA 5 / 1
Word Count::The Map source code

public class Map extends MapReduceBase implements Mapper<LongWritable, Text,

Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

LongWritable, Text, Text and IntWritable are Hadoop specific data types designed for
operational efficiency. All these data types are based out of Java data types; LongWritable
is the equivalent for long, IntWritable for int and Text for String.
Mapper<LongWritable, Text, Text, IntWritable> refers to the data type of input and
output key value pairs. The input key (LongWritable) is a default value, the input value
(Text) is a line. The output is of the format <word,1> hence the data type of the ouput is
Text and IntWritable.

W. PALMA 6 / 1
Word Count::The Map source code

public class Map extends MapReduceBase implements Mapper<LongWritable, Text,

In the map method map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter)
The first two parameters refer to the data type of the input to the mapper.
The third parameter OutputCollector<Text, IntWritable> output does the job of
taking the output data from the mapper. The Reporter is used to report the task status
internally in Hadoop environment.

W. PALMA 7 / 1
Word Count::The Reduce source code

public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,

Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

Considering Text, IntWritable, Text, IntWritable, the first two refers to data type of the
input (<we,1>) to the reducer. The last two refers to data type of the output
(<we,#occurrences>).
In the reduce method reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
The input to reduce method from the mapper after the sort and shuffle phase is of the
format <we,[1,1,1,1]>

W. PALMA 8 / 1
Word Count::The driver

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}

W. PALMA 9 / 1
Compilation and run

$ mkdir classes
$ javac -classpath /usr/share/hadoop/hadoop-core-0.20.204.0.jar -d classes/ *.java
$ jar -cvf wordcount.jar -C classes/ .
$ hadoop dfs -ls input/
$ hadoop jar wordcount.jar org.myorg.WordCount input/ output/
$ hadoop dfs -cat output/part-00000

W. PALMA 10 / 1
Exercise::Inverted index

W. PALMA 11 / 1
Pseudocode

class MAPPER
method MAP(docID n, doc d)
H = new AssociativeArray

for all term t in doc d do

H{t} = H{t}+1

for all term t in H do

EMIT(term t, posting <n,H{t}>)

class REDUCER
method REDUCE(term t, posting [<n1,f1>,<n2,f2>....])
P = new List

for all posting <docid,f> in postings [<n1,f1>,<n2,f2>....] do

Append(P,<docid,f>)
Sort(P)
EMIT(term t, postings P)

W. PALMA 12 / 1
Custom Data Types

In Hadoop we are free to define our own data types. In the above
pseudocode we must implement an object that represents a posting
composed of an document identifier and a term frequency.
The object marshaled to or from files and across the network must obey
the Writable interface, which allows Hadoop to read and write the data in
a serialized form for transmission.
The Writable interface requires two methods:
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}

The readFields() method initializes all of the fields of the object on data
contained in the binary stream in. The write() method reconstructs the
object to the binary stream out.

W. PALMA 13 / 1
Custom Data Types

The most important contract between readFields() and write()

methods is that they read and write the data in the same order.
The following code implements a class usable by Hadoop:
public class point2D implements Writable {
private IntWritable x;
private IntWritable y;
public point2D(IntWritable x, IntWritable y){
this.x = x;
this.y = y;
}
public point2D(){
this(new IntWritable(),new IntWritable());
}
public void write(DataOutput out) throws IOException {
x.write(out);
y.write(out);
}
public void readFields(DataInput in){
x.readFields(in);
y.readFields(in);
}
}

W. PALMA 14 / 1
Custom Key Types

If we want to emit custom objects as keys they must implement a stricter

interface, WritableComparable.

public class point2D implements WritableComparable {

private IntWritable x;
private IntWritable y;
public point2D(IntWritable x, IntWritable y){
this.x = x;
this.y = y;
}
public point2D(){
this(new IntWritable(),new IntWritable());
}
public void write(DataOutput out) throws IOException {
x.write(out);
y.write(out);
}
public void readFields(DataInput in){
x.readFields(in);
y.readFields(in);
}
public int compareTo(point2D other){
return Float.compare(distanceFromOrigin,other.distanceFromOrigin);
}
}

W. PALMA 15 / 1
Using Custom Types

The setOutPutKeyClass() and setOutPutValueClass() methods

control the output types for the map and reduce functions, which are
often the same.
If the map and reduce functions are different, you can set the types
emitted by the mapper with the setMapOutPutKeyClass() and
setMapOutPutValueClass() methods. These implicitly set the input
types expected by the reducer.

W. PALMA 16 / 1
Partitioning Data

Partitioning is the process of determining which reducer instance will

receive which intermediate keys and values.
It is necessary that for any key, regardless of which mapper instance
generated it, the destination partition is the same.
Hadoop determines when the job starts how many partitions it will divide
the data into. If ten reduce tasks are to be run, then ten partitions must
be filled.
The Partitioner defines one method which must be filled:
public interface Partitioner extends JobConfigurable{
int getPartition(K key, V value, int numPartitions);
}

After implementing the Partitioner interface, we must use the

JobConf.setPartitionerClass() method to tell Hadoop to use the
custom Partitioner in the job.

W. PALMA 17 / 1

Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Alter y XR Cheat Sheet
No ratings yet
Alter y XR Cheat Sheet
1 page
Critical Analysis of Fssai
100% (1)
Critical Analysis of Fssai
18 pages
IATF Top Management Requirements
No ratings yet
IATF Top Management Requirements
6 pages
3 MapReduce program ex code
No ratings yet
3 MapReduce program ex code
14 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Exp 3-Word Count
No ratings yet
Exp 3-Word Count
4 pages
Experiment-4 BDA LAB
No ratings yet
Experiment-4 BDA LAB
7 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
Practical 2-3
No ratings yet
Practical 2-3
3 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Hadoop
No ratings yet
Hadoop
38 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
Question Bank-BDA
No ratings yet
Question Bank-BDA
15 pages
Java CustomWritables
No ratings yet
Java CustomWritables
6 pages
Big Data - ASSIGNMENT 2
No ratings yet
Big Data - ASSIGNMENT 2
15 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
Ravikant_Hadoop_file
No ratings yet
Ravikant_Hadoop_file
22 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
MapReduce and Yarn
No ratings yet
MapReduce and Yarn
39 pages
Classcreation
No ratings yet
Classcreation
2 pages
Practical 3bcbs
No ratings yet
Practical 3bcbs
5 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
BDA
No ratings yet
BDA
6 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
Practical 2-1
No ratings yet
Practical 2-1
4 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
7 pages
Ruchin Practical 1,2,3
100% (1)
Ruchin Practical 1,2,3
30 pages
BDA List of Experiments For Practical Exam
No ratings yet
BDA List of Experiments For Practical Exam
21 pages
Map Reduce 101 Basic Template
No ratings yet
Map Reduce 101 Basic Template
1 page
Map Reduce
No ratings yet
Map Reduce
57 pages
Palak
No ratings yet
Palak
10 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
Untitled
No ratings yet
Untitled
59 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Week-8 de
No ratings yet
Week-8 de
9 pages
WordCount Program Hadoop Task 2
No ratings yet
WordCount Program Hadoop Task 2
7 pages
Big Data Lab
No ratings yet
Big Data Lab
12 pages
MapReduce - Notes
No ratings yet
MapReduce - Notes
17 pages
Assignment 1 Answer
0% (1)
Assignment 1 Answer
11 pages
CS 241 Section Week #3 (09/11/08)
No ratings yet
CS 241 Section Week #3 (09/11/08)
36 pages
BDA LAB Experiments
No ratings yet
BDA LAB Experiments
37 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
Web-Scale Data Processing: Christopher Olston and Many Others
No ratings yet
Web-Scale Data Processing: Christopher Olston and Many Others
32 pages
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
No ratings yet
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
5 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Analyzing The Data With Hadoop
No ratings yet
Analyzing The Data With Hadoop
13 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
2020300053_BDA_EXP2_CHINMAY
No ratings yet
2020300053_BDA_EXP2_CHINMAY
7 pages
Map Reduce Example
No ratings yet
Map Reduce Example
6 pages
19Nh14 102190051 Lab13 Chương Trình MapReduce Shortest Path Using Parallel Breadth First Search BFS 02
No ratings yet
19Nh14 102190051 Lab13 Chương Trình MapReduce Shortest Path Using Parallel Breadth First Search BFS 02
16 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
SalesData Map Reduce
No ratings yet
SalesData Map Reduce
3 pages
C Cheatsheet CodeWithHarry PDF
No ratings yet
C Cheatsheet CodeWithHarry PDF
11 pages
python_notes_DS_IanDomel
No ratings yet
python_notes_DS_IanDomel
10 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
RRU AISG Cable Assembly Specification
No ratings yet
RRU AISG Cable Assembly Specification
9 pages
MR 360 Clio 3
No ratings yet
MR 360 Clio 3
40 pages
Deepsea 7420 CAD Drawing
100% (1)
Deepsea 7420 CAD Drawing
2 pages
UL1973 Certificate- Unit
No ratings yet
UL1973 Certificate- Unit
1 page
6500 Brocade Compatibility Matrix Feb 2019
No ratings yet
6500 Brocade Compatibility Matrix Feb 2019
1 page
Ccna 3 Module 7 v4.0
100% (1)
Ccna 3 Module 7 v4.0
4 pages
Final County Assembly Attachment Report-Umma
No ratings yet
Final County Assembly Attachment Report-Umma
19 pages
Huawei Display Cmds
No ratings yet
Huawei Display Cmds
4 pages
Ewsd v16
No ratings yet
Ewsd v16
89 pages
Introduction and Basic Cocepts: MECH3023: Building Energy Management & Control Systems
No ratings yet
Introduction and Basic Cocepts: MECH3023: Building Energy Management & Control Systems
35 pages
Mission: LAN Administrator
No ratings yet
Mission: LAN Administrator
4 pages
Hospital Project on EPC Mode Analysis
No ratings yet
Hospital Project on EPC Mode Analysis
13 pages
EoC Telnet - Command Line
No ratings yet
EoC Telnet - Command Line
42 pages
IEC 61850 Certificate Level A
No ratings yet
IEC 61850 Certificate Level A
2 pages
Welding of Non-Code Special Equipment: NIOEC-SP-90-11
No ratings yet
Welding of Non-Code Special Equipment: NIOEC-SP-90-11
8 pages
Celulares Motorola
No ratings yet
Celulares Motorola
3 pages
Thinkpad T460S User Guide
No ratings yet
Thinkpad T460S User Guide
170 pages
278 CMS PDF
No ratings yet
278 CMS PDF
266 pages
Adding Server Role QUIZ
No ratings yet
Adding Server Role QUIZ
2 pages
Portable Truck Scale Brochure en
No ratings yet
Portable Truck Scale Brochure en
4 pages
TR Housekeeping NC III PDF
No ratings yet
TR Housekeeping NC III PDF
80 pages
Room Temperature Controllers With LCD: For Heating Systems
No ratings yet
Room Temperature Controllers With LCD: For Heating Systems
8 pages
En Piovan MDW DSC 45 00
No ratings yet
En Piovan MDW DSC 45 00
2 pages
RouteNGN LCR (Least Cost Routing) Platform Manual
No ratings yet
RouteNGN LCR (Least Cost Routing) Platform Manual
65 pages
Bpo2 PPT 3
No ratings yet
Bpo2 PPT 3
12 pages
Wayhan SCM Study Guide Exam 1
No ratings yet
Wayhan SCM Study Guide Exam 1
6 pages
ანევრიზმული სუბარაქნოიდული ჰემორაგიის მართვის გაიდლაინი - ამერიკის კარდიოლოგთა ასოციაციის (AHA) და ამერიკის ინსულტის ასოციაციის (ASA) მიერ შემუშავებული
No ratings yet
ანევრიზმული სუბარაქნოიდული ჰემორაგიის მართვის გაიდლაინი - ამერიკის კარდიოლოგთა ასოციაციის (AHA) და ამერიკის ინსულტის ასოციაციის (ASA) მიერ შემუშავებული
121 pages
Flash and Fire Point Test
No ratings yet
Flash and Fire Point Test
11 pages

Hadoop Developingapps PDF

Uploaded by

Hadoop Developingapps PDF

Uploaded by

Large Scale Data Processing

Dr. Wenceslao PALMA

The MapReduce job consists of the following:

Map(doc_id, record) --> [(word, 1)]

<we, 1>; <are, 1>; <not, 1>; <what, 1>; ....

public class Map extends MapReduceBase implements Mapper<LongWritable, Text,

public class Map extends MapReduceBase implements Mapper<LongWritable, Text,

In the map method map(LongWritable key, Text value, OutputCollector<Text,

public class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,

public static void main(String[] args) throws Exception {

FileInputFormat.setInputPaths(conf, new Path(args[0]));

for all term t in doc d do

for all term t in H do

for all posting <docid,f> in postings [<n1,f1>,<n2,f2>....] do

The most important contract between readFields() and write()

If we want to emit custom objects as keys they must implement a stricter

public class point2D implements WritableComparable {

The setOutPutKeyClass() and setOutPutValueClass() methods

Partitioning is the process of determining which reducer instance will

After implementing the Partitioner interface, we must use the

You might also like