0% found this document useful (0 votes)
3 views

Lecture 3 - MapReduce

This lecture covers the MapReduce framework, which allows for distributed data processing by breaking down tasks into map and reduce phases. It discusses the challenges of distributed programming, the role of Hadoop in implementing MapReduce, and the importance of handling key skew and optimizing performance through combiners. Additionally, it addresses criticisms of MapReduce and its evolution into more advanced systems like Apache Spark and Apache Flink.

Uploaded by

teun.bobbink
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 3 - MapReduce

This lecture covers the MapReduce framework, which allows for distributed data processing by breaking down tasks into map and reduce phases. It discusses the challenges of distributed programming, the role of Hadoop in implementing MapReduce, and the importance of handling key skew and optimizing performance through combiners. Additionally, it addresses criticisms of MapReduce and its evolution into more advanced systems like Apache Spark and Apache Flink.

Uploaded by

teun.bobbink
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture 3 - MapReduce

What should you be able to do after this week?


Describe the anatomy of a MapReduce job
Analyse the suitability of the MapReduce approach for a given problem
Design implementations for MapReduce programs

Introduction & Motivation


Motivation — Text Indexing
Say you have a dataset of N documents (with N very large, e.g. the web), and you want to construct an
index: words → documents
On a single machine, this process takes O(N) time

Observation: this problem is (almost) embarrassingly parallel

Whether any word appears in a document is independent of other documents

We should be able to process documents independently and combine the results

We could have multiple computers write to a shared database


With M machines, can we lowe the time to O(N/M)?

How to distribute work (and data) and collect results?

MapReduce provides a framework for this

Hadoop provides an open source implementation of Map-Reduce and supporting infrastructure for
distributed computing

Power Through Restrictions


RDBMS/SQL empowers us by restricting how we store and query data
MapReduce empowers us by restricting how we implement algorithms

Why “map” and “reduce”?


Map and reduce are common second order functions in functional programming, e.g.: Haskell or Scala

A second order function takes another function as argument

Example — Sum of Squares

Lecture 3 - MapReduce 1
Working with MapReduce
Conceptual Framework

Why does this help with Distributed Data Processing?


Distributed programming is very hard, common challenges:

Scheduling (which piece of work to execute when)

Concurrency (how to run certain parts of our computation in parallel)

Fault Tolerance (how to handle machine/disk failures)

Design goal of MapReduce:

Programmer only has to think about the logic of their program (expressed in the fmap and

f
reduce functions)

Runtime (e.g., Hadoop) automatically takes care of scheduling, concurrency, fault tolerance

Distributed Execution of a MapReduce program

Lecture 3 - MapReduce 2
Map phase

Read input data

Generate intermediate results via fmap 


Shuffle phase

Group intermediate results by key

Move data from mappers to reducers

Reduce phase

Execute freduce and collect output


Example: Distributed Word Counting


Task: given a large collection of text documents, count how often each word occurs overall
MapReduce implementation:

Example — Input Data

Example — Map-Phase

Lecture 3 - MapReduce 3
Example — Shuffle-Phase

Example — Reduce-Phase

Task: MapReduce Movement


Illustrate the intermediate results and data movement of the following MapReduce Job

Lecture 3 - MapReduce 4
MapReduce in Practice
Shuffling (and Sorting)
Say the map-phase produces K total intermediate keys and we have R reducer nodes

How to efficiently assign the work for the K keys to our R reducer nodes?
Hash-partitioning: determine reducer node r for a key k as follows:

r = hash(k) mod R
Shuffle-phase in MapReduce implementations like Hadoop:

Use hash-partitioning to assign keys to reducers

Use distributed sorting to form the groups of keys and values required for the reduce-phase

Key Assignment
All values for a given key k need to go to exactly one reducer
Conversely: a reducer applying freduce on an intermediate key k needs to see all associated values

This can have performance impact!

Key Skew
What happens when the intermediate key distribution is unbalanced?
All values for the same key must go to the same reducer

Lecture 3 - MapReduce 5
Different reducers will have different work loads
This is called key skew (or data skew), and it can have a negative performance impact!

In the worst case, we have to wait for one reducer to finish the
work for one large key group!

Combiners
Key-skew leads to high latency
Reducer time typically scales with the number of values per key

Lots of keys ⇒ lots of communication (shuffling data is expensive!)


We can sometimes simplify the reducer’s job by pre-aggregating (combining) data before shuffling via a
function fcombine 

Combiner for Word Counting


This works because summation is commutative and associative:
A+B=B+A

A + B + C = (A + B) + C
When that happens, you can re-use freduce as fcombine !
​ ​

Combiner for Averaging


Key idea: propagate the sum and the count!

fcombine can then preaggregate the intermediate


sums and counts

freduce can compute the final average via the total


sum divided by the total count

Lecture 3 - MapReduce 6
Tips for MapReduce in Practice
Have fewer reducer nodes than intermediate keys to keep nodes busy!

Combiners can help, but sometimes a custom pre-aggregation during the map-phase is even better
Very advanced MapReduce programs exploit the sortedness of the reduce inputs

In a join implementation, we can leverage this to see one join input before the other

Task: OPT
The following MapReduce program operates on log data from a video streaming platform. Its input data
consists of key value pairs in the format (video, (calendar_week, daily_views)). This data denotes a list of
views per day (daily_views) in a given calendar_week for a given video.
The MapReduce program computes the minimum number of views per day for a video after the tenth
calendar week. Can you rewrite the program to make it more efficient (e.g., to have it send less data from
the map-phase to the reduce-phase)?

Criticisms of MapReduce
Criticism 1: Too low-level
No schema for processed data
Lack of a high-level access language like SQL
Lack of support for important relational operations like joins

“MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern
DBMSs were invented.”

Drawbacks often addressed with layers on top of MapReduce like Apache Pig or Apache Hive

Criticism 2: Poor Implementation


MapReduce does not index data like an RDBMS, indexing can greatly accelerate many queries!

Lecture 3 - MapReduce 7
For example, if we only need to access a given
subset of the data MapReduce has to scan the
whole input data!

No optimised execution for complex programs consisting ofmultiple MapReduce jobs

Intermediate results always written to distributed storage in between!

Criticism 3: Not novel


Plenty of previous systems apply distributed partitioning and aggregation
Fundamental primitives in distributed relational databases!

Criticism 4: Lack of DBMS compatibility


Lots of infrastructure has been built on top of standard DBMS for, e.g.,

Visualization

Data migration

Database design

Not compatible with MapReduce!


Nowadays, many systems support SQL-like queries of data in data lakes

The Big Question


Why was MapReduce so successful?

Google & the Rise of the Web


Rise of the world wide web in the 1990s produces growing need to query and index the data available
online

Search engine companies found database technology neither well suited nor cost-effective
Relational data management mismatch for web search:

Dirty, semi-structured web data hard to fit into a relational schema

High availability much more important than consistency

New types of queries very different from traditional SQL-based data analysis, e.g.,

Extracting content from web pages (information extraction)

Ranking of search results based on link structure of the web (graph processing)

What is left of MapReduce nowadays?


MapReduce subsumed into more general abstractions and systems for distributed dataflow processing

Apache Spark

Apache Flink

Apache Beam

All these systems can run MapReduce jobs!

Lecture 3 - MapReduce 8
Lecture 3 - MapReduce 9

You might also like