0% found this document useful (0 votes)
45 views

BDA05 DistributedComputing

This document discusses distributed computing and MapReduce, specifically: 1. The MapReduce model assigns data splits to mappers which apply map functions to produce intermediate key-value pairs, then reducers combine values by key and apply reduce functions to produce outputs. 2. MapReduce is better than Extract-Transform-Load (ETL) as it moves computation rather than data, improves throughput by distributing tasks across nodes, and maintains data locality. 3. MapReduce can scale to large clusters by partitioning data across machines and moving computation rather than transferring all data over networks.

Uploaded by

Gargi Jana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

BDA05 DistributedComputing

This document discusses distributed computing and MapReduce, specifically: 1. The MapReduce model assigns data splits to mappers which apply map functions to produce intermediate key-value pairs, then reducers combine values by key and apply reduce functions to produce outputs. 2. MapReduce is better than Extract-Transform-Load (ETL) as it moves computation rather than data, improves throughput by distributing tasks across nodes, and maintains data locality. 3. MapReduce can scale to large clusters by partitioning data across machines and moving computation rather than transferring all data over networks.

Uploaded by

Gargi Jana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

DISTRIBUTED

COMPUTING
Shankar Venkatagiri
\
MODEL

➤ Responsibility: User writes a map & reduce function pair,


and speci es to MapReduce the input/output locations
1. MapReduce Master assigns data (“split”) to mappers
2. Each map function takes an input pair and produces a
set of intermediate key-value pairs
3. MapReduce Master groups intermediate values by key
and assigns them to reducers
4. Each reduce function takes a key + values for that key.
Next, it acts on the values and produces zero/one output
fi

DIFFERENCE

➤ Contrast with the ETL model - Extract-Transform-Load


➤ Schema-on-write versus schema-on-read (e.g. tweets)
➤ Programming must take care of parsing data
➤ Moving Computation is Cheaper than Moving Data
➤ The Master orchestrates multiple map & reduce tasks to
chomp through data splits - improves overall throughput
➤ Locality: The Master maintains all locational information.
It assigns splits to idle map tasks by proximity to the data
➤ Redundancy: Should any nodes/tasks slow down, backup
tasks are launched

SCALING UP

➤ Video: Inside a Google Data Centre (1:34 onwards)


➤ R and Python help analytical processing on a single machine
➤ Q: Why not extend this framework to a cluster?
➤ Data is partitioned across multiple machines in a network
➤ Network transfer rates ≫ Memory access
➤ Probability of failure increases with size and expanse
➤ Ryza et al. “These facts require a programming paradigm that is
sensitive to the characteristics of the underlying system: one that
discourages poor choices and makes it easy to write code that will
execute in a highly parallel manner.”

SPOTLIGHT
➤ Sanjay Ghemawat, Google Fellow
➤ Ph.D (MIT), BS (Cornell)
➤ Winner, ACM Prize in Computing
➤ Projects worked on
➤ MapReduce
➤ Google File System aka GFS
➤ BigTable
➤ TensorFlow
➤ …

MAGICIANS
➤ Dean & Ghemawat (2008): “Users specify the computation in
terms of a map and a reduce function, and the underlying runtime
system automatically parallelizes the computation across large-scale
clusters of machines, handles machine failures, and schedules inter-
machine communication to make e cient use of the network & disks.”
➤ Video: Google Round Table (upto 1:12, 11:10 - 13:35)
ffi

You might also like