Hadoop Developingapps PDF
Hadoop Developingapps PDF
Hadoop
W. PALMA 1 / 1
Word Count Example
The goal of this example is to count the number of distinct words in a given
text.
class MAPPER
method MAP(docID a, doc d)
for all term t in doc d do
EMIT(term t, count 1)
The MAP method takes an input pair and produces a set of intermediate
<key,value> pairs. Then all the intermediate values associated with the same
intermediate key are grouped by the MapReduce library (shuffle phase).
W. PALMA 2 / 1
Word Count Example
class REDUCER
method REDUCE(term t, counts[c1,c2,...])
sum = 0
for all count c in counts[c1,c2,...] do
sum = sum + c
EMIT(term t, count sum)
The REDUCE method receives an intermediate key and a set of values for
that key merging together these values to form a smaller set of values.
W. PALMA 3 / 1
Word Count Example
Suposse we are give the following input file:
We are not what
we want to be,
but at least
we are not what
we used to be.
In the map phase the text is tokenized into words. Then a <word,1> pair is formed
with these words.
Remember that <key, value> pairs are generated in parallel on many machines.
Each task has a little part of the overall Map input
W. PALMA 4 / 1
Word Count Example
Considering our input text, in preparation for the reduce phase all the “we” pairs are
grouped togheter, all the “what” pairs are grouped togheter, etc.
<we, 1> <we, 1> <we, 1> <we, 1> −− > <we, [1,1,1,1]>
<are, 1> <are, 1> −− > <are, [1,1]>
<not, 1> <not, 1> −− > <not, [1,1]>
...
In the reduce phase a reduce function is called once for each key. The reduce phase
also sorts the output into increasing order by key as follows:
<are, 2>; <at, 1>; <be, 2>; <but, 1>; <least, 1>; <not, 2>; <to, 2>;
<used, 1>; <want, 1>; <we, 4>; <what, 2>
Like in the map phase, the reduce phase is also run in parallel. Each machine is
assigned a subset of the keys to work on. The results are stored into a separate file.
W. PALMA 5 / 1
Word Count::The Map source code
LongWritable, Text, Text and IntWritable are Hadoop specific data types designed for
operational efficiency. All these data types are based out of Java data types; LongWritable
is the equivalent for long, IntWritable for int and Text for String.
Mapper<LongWritable, Text, Text, IntWritable> refers to the data type of input and
output key value pairs. The input key (LongWritable) is a default value, the input value
(Text) is a line. The output is of the format <word,1> hence the data type of the ouput is
Text and IntWritable.
W. PALMA 6 / 1
Word Count::The Map source code
W. PALMA 7 / 1
Word Count::The Reduce source code
Considering Text, IntWritable, Text, IntWritable, the first two refers to data type of the
input (<we,1>) to the reducer. The last two refers to data type of the output
(<we,#occurrences>).
In the reduce method reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
The input to reduce method from the mapper after the sort and shuffle phase is of the
format <we,[1,1,1,1]>
W. PALMA 8 / 1
Word Count::The driver
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
W. PALMA 9 / 1
Compilation and run
$ mkdir classes
$ javac -classpath /usr/share/hadoop/hadoop-core-0.20.204.0.jar -d classes/ *.java
$ jar -cvf wordcount.jar -C classes/ .
$ hadoop dfs -ls input/
$ hadoop jar wordcount.jar org.myorg.WordCount input/ output/
$ hadoop dfs -cat output/part-00000
W. PALMA 10 / 1
Exercise::Inverted index
W. PALMA 11 / 1
Pseudocode
class MAPPER
method MAP(docID n, doc d)
H = new AssociativeArray
class REDUCER
method REDUCE(term t, posting [<n1,f1>,<n2,f2>....])
P = new List
W. PALMA 12 / 1
Custom Data Types
In Hadoop we are free to define our own data types. In the above
pseudocode we must implement an object that represents a posting
composed of an document identifier and a term frequency.
The object marshaled to or from files and across the network must obey
the Writable interface, which allows Hadoop to read and write the data in
a serialized form for transmission.
The Writable interface requires two methods:
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}
The readFields() method initializes all of the fields of the object on data
contained in the binary stream in. The write() method reconstructs the
object to the binary stream out.
W. PALMA 13 / 1
Custom Data Types
W. PALMA 14 / 1
Custom Key Types
W. PALMA 15 / 1
Using Custom Types
W. PALMA 16 / 1
Partitioning Data
W. PALMA 17 / 1