Homework_Labs_Lecture2
Homework_Labs_Lecture2
Files:
WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
wc.jar: The compiled, assembled WordCount program
In this lab you will compile Java files, create a JAR, and run MapReduce jobs.
In
addition
to
manipulating
files
in
HDFS,
the
wrapper
program
hadoop
is
used
to
launch
MapReduce
jobs.
The
code
for
a
job
is
contained
in
a
compiled
JAR
file.
Hadoop
loads
the
JAR
into
HDFS
and
distributes
it
to
the
worker
nodes,
where
the
individual
tasks
of
the
MapReduce
job
are
executed.
One
simple
example
of
a
MapReduce
job
is
to
count
the
number
of
occurrences
of
each
word
in
a
file
or
set
of
files.
In
this
lab
you
will
compile
and
submit
a
MapReduce
job
to
count
the
number
of
occurrences
of
every
word
in
the
works
of
Shakespeare.
$ cd ~/workspace/wordcount/src
$ ls
$ ls stubs
Examine
these
files
if
you
wish,
but
do
not
change
them.
Remain
in
this
directory
while
you
execute
the
following
commands.
2. Before compiling, examine the classpath Hadoop is configured to use:
$ hadoop classpath
This
shows
lists
the
locations
where
the
Hadoop
core
API
classes
are
installed.
Note:
in
the
command
above,
the
quotes
around
hadoop classpath
are
backquotes.
This
runs
the
hadoop classpath
command
and
uses
its
output
as
part
of
the
javac
command.
The compiled (.class) files are placed in the stubs directory.
5.
Submit
a
MapReduce
job
to
Hadoop
using
your
JAR
file
to
count
the
occurrences
of
each
word
in
Shakespeare:
This
hadoop jar
command
names
the
JAR
file
to
use
(wc.jar),
the
class
whose
main
method
should
be
invoked
(stubs.WordCount),
and
the
HDFS
input
and
output
directories
to
use
for
the
MapReduce
job.
Your
job
halts
right
away
with
an
exception,
because
Hadoop
automatically
fails
if
your
job
tries
to
write
its
output
into
an
existing
directory.
This
is
by
design;
since
the
result
of
a
MapReduce
job
may
be
expensive
to
reproduce,
Hadoop
prevents
you
from
accidentally
overwriting
previously
existing
files.
This
lists
the
output
files
for
your
job.
(Your
job
ran
with
only
one
Reducer,
so
there
should
be
one
file,
named
part-r-00000,
along
with
a
_SUCCESS
file
and
a
_logs
directory.)
You
can
page
through
a
few
screens
to
see
words
and
their
frequencies
in
the
works
of
Shakespeare.
(The
spacebar
will
scroll
the
output
by
one
screen;
the
letter
'q'
will
quit
the
less
utility.)
Note
that
you
could
have
specified
wordcounts/*
just
as
well
in
this
command.
Take care when using wildcards (e.g. *) when specifying HFDS filenames;
because of how Linux works, the shell will attempt to expand the wildcard
before invoking hadoop, and then pass incorrect references to local files instead
of HDFS files. You can prevent this by enclosing the wildcarded HDFS filenames
in single quotes, e.g. hadoop fs –cat 'wordcounts/*'
When the job completes, inspect the contents of the pwords HDFS directory.
10. Clean up the output files produced by your job runs:
A
MapReduce
job,
once
submitted
to
Hadoop,
runs
independently
of
the
initiating
process,
so
losing
the
connection
to
the
initiating
process
does
not
kill
the
job.
Instead,
you
need
to
tell
the
Hadoop
JobTracker
to
stop
the
job.
2. While this job is running, open another terminal window and enter:
This
lists
the
job
ids
of
all
running
jobs.
A
job
id
looks
something
like:
job_200902131742_0002
3. Copy the job id, and then kill the running job by entering:
The
JobTracker
kills
the
job,
and
the
program
running
in
the
original
terminal
completes.