Map Reduce
Map Reduce
FUNDAMENTAL
CONCEPTS
Why MapReduce?
■ How many movies did each user rate in the MovieLens data set?
How MapReduce Works: Mapping
INPUT DATA
Mapper
Mapper
len(movies)
MAPPER
REDUCER
MAPPER
REDUCER
MapTask /
ReduceTask
Key/values
stdin stdout
Streaming
Process
Handling Failure
■ PIP
– Utility for installing Python packages
– yum install python-pip
■ MRJob
– pip install mrjob==0.5.11
■ Nano
– yum install nano
■ Data files and the script
– wget http://media.sundog-soft.com/hadoop/ml-
100k/u.data
– wget http://media.sundog-
soft.com/hadoop/RatingsBreakdown.py
Running with mrjob
■ Run locally
– python RatingsBreakdown.py u.data
■ Run with Hadoop
– python MostPopularMovie.py -r hadoop --hadoop-streaming-jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar
u.data
YOUR CHALLENGE
Sort movies by popularity with Hadoop
Challenge exercise
■ By default, streaming treats all input and output as strings. So things get
sorted as strings, not numerically.
■ There are different formats you can specify. But for now let’s just zero-pad
our numbers so they’ll sort properly.
■ The second reducer will look like this:
def reducer_count_ratings(self, key, values):
yield str(sum(values)).zfill(5), key
Iterating through the results
■ Spoiler alert!
def reducer_sorted_output(self, count, movies):
for movie in movies:
yield movie, count
CHECK YOUR RESULTS
Did it work?
My solution