0% found this document useful (0 votes)

3 views

Lecture 3 - MapReduce

This lecture covers the MapReduce framework, which allows for distributed data processing by breaking down tasks into map and reduce phases. It discusses the challenges of distributed programming, the role of Hadoop in implementing MapReduce, and the importance of handling key skew and optimizing performance through combiners. Additionally, it addresses criticisms of MapReduce and its evolution into more advanced systems like Apache Spark and Apache Flink.

Uploaded by

teun.bobbink

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture 3 - MapReduce

Uploaded by

teun.bobbink

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Lecture 3 - MapReduce

What should you be able to do after this week?

Describe the anatomy of a MapReduce job
Analyse the suitability of the MapReduce approach for a given problem
Design implementations for MapReduce programs

Introduction & Motivation

Motivation — Text Indexing
Say you have a dataset of N documents (with N very large, e.g. the web), and you want to construct an
index: words → documents
On a single machine, this process takes O(N) time

Observation: this problem is (almost) embarrassingly parallel

Whether any word appears in a document is independent of other documents

We should be able to process documents independently and combine the results

We could have multiple computers write to a shared database

With M machines, can we lowe the time to O(N/M)?

How to distribute work (and data) and collect results?

MapReduce provides a framework for this

Hadoop provides an open source implementation of Map-Reduce and supporting infrastructure for
distributed computing

Power Through Restrictions

RDBMS/SQL empowers us by restricting how we store and query data
MapReduce empowers us by restricting how we implement algorithms

Why “map” and “reduce”?

Map and reduce are common second order functions in functional programming, e.g.: Haskell or Scala

A second order function takes another function as argument

Example — Sum of Squares

Lecture 3 - MapReduce 1
Working with MapReduce
Conceptual Framework

Why does this help with Distributed Data Processing?

Distributed programming is very hard, common challenges:

Scheduling (which piece of work to execute when)

Concurrency (how to run certain parts of our computation in parallel)

Fault Tolerance (how to handle machine/disk failures)

Design goal of MapReduce:

Programmer only has to think about the logic of their program (expressed in the fmap and

f
reduce functions)

Runtime (e.g., Hadoop) automatically takes care of scheduling, concurrency, fault tolerance

Distributed Execution of a MapReduce program

Lecture 3 - MapReduce 2
Map phase

Read input data

Generate intermediate results via fmap

Shuffle phase

Group intermediate results by key

Move data from mappers to reducers

Reduce phase

Execute freduce and collect output

Example: Distributed Word Counting

Task: given a large collection of text documents, count how often each word occurs overall
MapReduce implementation:

Example — Input Data

Example — Map-Phase

Lecture 3 - MapReduce 3
Example — Shuffle-Phase

Example — Reduce-Phase

Task: MapReduce Movement

Illustrate the intermediate results and data movement of the following MapReduce Job

Lecture 3 - MapReduce 4
MapReduce in Practice
Shuffling (and Sorting)
Say the map-phase produces K total intermediate keys and we have R reducer nodes

How to efficiently assign the work for the K keys to our R reducer nodes?
Hash-partitioning: determine reducer node r for a key k as follows:

r = hash(k) mod R
Shuffle-phase in MapReduce implementations like Hadoop:

Use hash-partitioning to assign keys to reducers

Use distributed sorting to form the groups of keys and values required for the reduce-phase

Key Assignment
All values for a given key k need to go to exactly one reducer
Conversely: a reducer applying freduce on an intermediate key k needs to see all associated values

This can have performance impact!

Key Skew
What happens when the intermediate key distribution is unbalanced?
All values for the same key must go to the same reducer

Lecture 3 - MapReduce 5
Different reducers will have different work loads
This is called key skew (or data skew), and it can have a negative performance impact!

In the worst case, we have to wait for one reducer to finish the
work for one large key group!

Combiners
Key-skew leads to high latency
Reducer time typically scales with the number of values per key

Lots of keys ⇒ lots of communication (shuffling data is expensive!)

We can sometimes simplify the reducer’s job by pre-aggregating (combining) data before shuffling via a
function fcombine

Combiner for Word Counting

This works because summation is commutative and associative:
A+B=B+A

A + B + C = (A + B) + C
When that happens, you can re-use freduce as fcombine !

Combiner for Averaging

Key idea: propagate the sum and the count!

fcombine can then preaggregate the intermediate

sums and counts

freduce can compute the final average via the total

sum divided by the total count

Lecture 3 - MapReduce 6
Tips for MapReduce in Practice
Have fewer reducer nodes than intermediate keys to keep nodes busy!

Combiners can help, but sometimes a custom pre-aggregation during the map-phase is even better
Very advanced MapReduce programs exploit the sortedness of the reduce inputs

In a join implementation, we can leverage this to see one join input before the other

Task: OPT
The following MapReduce program operates on log data from a video streaming platform. Its input data
consists of key value pairs in the format (video, (calendar_week, daily_views)). This data denotes a list of
views per day (daily_views) in a given calendar_week for a given video.
The MapReduce program computes the minimum number of views per day for a video after the tenth
calendar week. Can you rewrite the program to make it more efficient (e.g., to have it send less data from
the map-phase to the reduce-phase)?

Criticisms of MapReduce
Criticism 1: Too low-level
No schema for processed data
Lack of a high-level access language like SQL
Lack of support for important relational operations like joins

“MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern
DBMSs were invented.”

Drawbacks often addressed with layers on top of MapReduce like Apache Pig or Apache Hive

Criticism 2: Poor Implementation

MapReduce does not index data like an RDBMS, indexing can greatly accelerate many queries!

Lecture 3 - MapReduce 7
For example, if we only need to access a given
subset of the data MapReduce has to scan the
whole input data!

No optimised execution for complex programs consisting ofmultiple MapReduce jobs

Intermediate results always written to distributed storage in between!

Criticism 3: Not novel

Plenty of previous systems apply distributed partitioning and aggregation
Fundamental primitives in distributed relational databases!

Criticism 4: Lack of DBMS compatibility

Lots of infrastructure has been built on top of standard DBMS for, e.g.,

Visualization

Data migration

Database design

Not compatible with MapReduce!

Nowadays, many systems support SQL-like queries of data in data lakes

The Big Question

Why was MapReduce so successful?

Google & the Rise of the Web

Rise of the world wide web in the 1990s produces growing need to query and index the data available
online

Search engine companies found database technology neither well suited nor cost-effective
Relational data management mismatch for web search:

Dirty, semi-structured web data hard to fit into a relational schema

High availability much more important than consistency

New types of queries very different from traditional SQL-based data analysis, e.g.,

Extracting content from web pages (information extraction)

Ranking of search results based on link structure of the web (graph processing)

What is left of MapReduce nowadays?

MapReduce subsumed into more general abstractions and systems for distributed dataflow processing

Apache Spark

Apache Flink

Apache Beam

All these systems can run MapReduce jobs!

Lecture 3 - MapReduce 8
Lecture 3 - MapReduce 9

Principles of Stewardship and Role of Nurses As Stewards
100% (6)
Principles of Stewardship and Role of Nurses As Stewards
8 pages
Training Manual: Sap Plant Maintenance Module
100% (1)
Training Manual: Sap Plant Maintenance Module
62 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
MapReduce Online
No ratings yet
MapReduce Online
15 pages
Unit 4 CS 3RD Yr
No ratings yet
Unit 4 CS 3RD Yr
13 pages
BDA Experiment 3
No ratings yet
BDA Experiment 3
7 pages
Own Answer 2
No ratings yet
Own Answer 2
22 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Cloud Series 2 ORAF
No ratings yet
Cloud Series 2 ORAF
19 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
What is Map Reduce Programming Model_ Explain.
No ratings yet
What is Map Reduce Programming Model_ Explain.
3 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Matchmaking: A New Mapreduce Scheduling Technique: Digitalcommons@University of Nebraska - Lincoln
No ratings yet
Matchmaking: A New Mapreduce Scheduling Technique: Digitalcommons@University of Nebraska - Lincoln
9 pages
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
No ratings yet
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
94 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
CCD-333 Exam Tutorial
No ratings yet
CCD-333 Exam Tutorial
20 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
exp5bdafinal
No ratings yet
exp5bdafinal
7 pages
Unit 3
No ratings yet
Unit 3
10 pages
System Design and Implementation 5.1 System Design
No ratings yet
System Design and Implementation 5.1 System Design
14 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Data Science
No ratings yet
Data Science
7 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
R20 3-2 Cloud Computing UNIT - 5
No ratings yet
R20 3-2 Cloud Computing UNIT - 5
22 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
bda_model
No ratings yet
bda_model
32 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Map Reduce 1
No ratings yet
Map Reduce 1
50 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
UNIT-3-IDS
No ratings yet
UNIT-3-IDS
24 pages
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
No ratings yet
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
7 pages
Cloud Notes - Unit - 5
No ratings yet
Cloud Notes - Unit - 5
31 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
What is Map Reduce programming model_ Explain.
No ratings yet
What is Map Reduce programming model_ Explain.
7 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
No ratings yet
Low-Latency, High-Throughput Access To Static Global Resources Within The Hadoop Framework
15 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
BFSMpR:A BFS Graph Based Recommendation System Using Map Reduce
No ratings yet
BFSMpR:A BFS Graph Based Recommendation System Using Map Reduce
5 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Performance Task 1 - 2.3.1.2 Packet Tracer
No ratings yet
Performance Task 1 - 2.3.1.2 Packet Tracer
11 pages
GPS Jamming
100% (1)
GPS Jamming
7 pages
arvng
No ratings yet
arvng
4 pages
Basic Christmas Wreath
No ratings yet
Basic Christmas Wreath
11 pages
Final Report - Bhoomit
No ratings yet
Final Report - Bhoomit
70 pages
BPSM Unit 2.3 Strategic Management Process concept
No ratings yet
BPSM Unit 2.3 Strategic Management Process concept
13 pages
Who Is Hindu and What Is Special About Being A Hindu by Francois Gautier
No ratings yet
Who Is Hindu and What Is Special About Being A Hindu by Francois Gautier
4 pages
WIDGB2 Utest Skills 3A
No ratings yet
WIDGB2 Utest Skills 3A
4 pages
Printing List & Examples
No ratings yet
Printing List & Examples
1 page
GE Hydrogen-Fueled Turbines
No ratings yet
GE Hydrogen-Fueled Turbines
11 pages
11 DIY Christmas Decorations and Gift Ideas PDF
No ratings yet
11 DIY Christmas Decorations and Gift Ideas PDF
47 pages
Vernon-Lecture-Magic-Castle-April-1-1977-LL-DVD
No ratings yet
Vernon-Lecture-Magic-Castle-April-1-1977-LL-DVD
12 pages
DA Portfolio Project
No ratings yet
DA Portfolio Project
16 pages
The Principles and Practice of Effective Leadership
No ratings yet
The Principles and Practice of Effective Leadership
585 pages
L24 What Animals Eat D
No ratings yet
L24 What Animals Eat D
8 pages
Science Songs
No ratings yet
Science Songs
89 pages
Battery Charger Owner'S Manual: Interacter
No ratings yet
Battery Charger Owner'S Manual: Interacter
12 pages
Rajiv Gandhi Govt Polytechnic, Itanagar A.P.: Sewerage System in Hilly Region
No ratings yet
Rajiv Gandhi Govt Polytechnic, Itanagar A.P.: Sewerage System in Hilly Region
32 pages
Project Based Homework Ideas
100% (1)
Project Based Homework Ideas
8 pages
To The Cuckoo by Wordsworth
No ratings yet
To The Cuckoo by Wordsworth
4 pages
Al & Si D5184
No ratings yet
Al & Si D5184
6 pages
Guttenberg Press Higher Ability Information Leaflet
No ratings yet
Guttenberg Press Higher Ability Information Leaflet
1 page
Solve Simple Cube Root Equations - Worksheet
No ratings yet
Solve Simple Cube Root Equations - Worksheet
5 pages
2
No ratings yet
2
8 pages
POLYGRAPHY
No ratings yet
POLYGRAPHY
88 pages
Argus For Asia - CSS54E - WE
No ratings yet
Argus For Asia - CSS54E - WE
2 pages
Review of Related Literature
No ratings yet
Review of Related Literature
34 pages
PSYC7322 Ea
No ratings yet
PSYC7322 Ea
10 pages