Lecture 3 - MapReduce
Lecture 3 - MapReduce
Hadoop provides an open source implementation of Map-Reduce and supporting infrastructure for
distributed computing
Lecture 3 - MapReduce 1
Working with MapReduce
Conceptual Framework
Programmer only has to think about the logic of their program (expressed in the fmap and
f
reduce functions)
Runtime (e.g., Hadoop) automatically takes care of scheduling, concurrency, fault tolerance
Lecture 3 - MapReduce 2
Map phase
Shuffle phase
Reduce phase
Example — Map-Phase
Lecture 3 - MapReduce 3
Example — Shuffle-Phase
Example — Reduce-Phase
Lecture 3 - MapReduce 4
MapReduce in Practice
Shuffling (and Sorting)
Say the map-phase produces K total intermediate keys and we have R reducer nodes
How to efficiently assign the work for the K keys to our R reducer nodes?
Hash-partitioning: determine reducer node r for a key k as follows:
r = hash(k) mod R
Shuffle-phase in MapReduce implementations like Hadoop:
Use distributed sorting to form the groups of keys and values required for the reduce-phase
Key Assignment
All values for a given key k need to go to exactly one reducer
Conversely: a reducer applying freduce on an intermediate key k needs to see all associated values
Key Skew
What happens when the intermediate key distribution is unbalanced?
All values for the same key must go to the same reducer
Lecture 3 - MapReduce 5
Different reducers will have different work loads
This is called key skew (or data skew), and it can have a negative performance impact!
In the worst case, we have to wait for one reducer to finish the
work for one large key group!
Combiners
Key-skew leads to high latency
Reducer time typically scales with the number of values per key
A + B + C = (A + B) + C
When that happens, you can re-use freduce as fcombine !
Lecture 3 - MapReduce 6
Tips for MapReduce in Practice
Have fewer reducer nodes than intermediate keys to keep nodes busy!
Combiners can help, but sometimes a custom pre-aggregation during the map-phase is even better
Very advanced MapReduce programs exploit the sortedness of the reduce inputs
In a join implementation, we can leverage this to see one join input before the other
Task: OPT
The following MapReduce program operates on log data from a video streaming platform. Its input data
consists of key value pairs in the format (video, (calendar_week, daily_views)). This data denotes a list of
views per day (daily_views) in a given calendar_week for a given video.
The MapReduce program computes the minimum number of views per day for a video after the tenth
calendar week. Can you rewrite the program to make it more efficient (e.g., to have it send less data from
the map-phase to the reduce-phase)?
Criticisms of MapReduce
Criticism 1: Too low-level
No schema for processed data
Lack of a high-level access language like SQL
Lack of support for important relational operations like joins
“MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern
DBMSs were invented.”
Drawbacks often addressed with layers on top of MapReduce like Apache Pig or Apache Hive
Lecture 3 - MapReduce 7
For example, if we only need to access a given
subset of the data MapReduce has to scan the
whole input data!
Visualization
Data migration
Database design
Search engine companies found database technology neither well suited nor cost-effective
Relational data management mismatch for web search:
New types of queries very different from traditional SQL-based data analysis, e.g.,
Ranking of search results based on link structure of the web (graph processing)
Apache Spark
Apache Flink
Apache Beam
Lecture 3 - MapReduce 8
Lecture 3 - MapReduce 9