Cloud Computing

Cloud Computing Lecture #1
What is Cloud Computing?
(and an intro to parallel/distributed processing)
Jimmy Lin
The iSchool
University of Maryland
Wednesday, September 3, 2008
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States
See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet,
Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

Source: http://www.free-pictures-photos.com/

The iSchool
What is Cloud Computing?
1. Web-scale problems
2. Large data centers
3. Different models of computing
4. Highly-interactive Web applications

The iSchool
1. Web-Scale Problems
 Characteristics:
 Definitely data-intensive
 May also be processing intensive
 Examples:
 Crawling, indexing, searching, mining the Web
 “Post-genomics” life sciences research
 Other scientific data (physics, astronomers, etc.)
 Sensor networks
 Web 2.0 applications
 …

The iSchool
How much data?
 Wayback Machine has 2 PB + 20 TB/month (2006)
 Google processes 20 PB a day (2008)
 “all words ever spoken by human beings” ~ 5 EB
 NOAA has ~1 PB climate data (2007)
 CERN’s LHC will generate 15 PB a year (2008)
640K ought to be
enough for anybody.

The iSchool
There’s nothing like more data!
s/inspiration/data/g;
(Banko and Brill, ACL 2001)
(Brants et al., EMNLP 2007)

The iSchool
What to do with more data?
 Answering factoid questions
 Pattern matching on the Web
 Works amazingly well
 Learning relations
 Start with seed instances
 Search for patterns on the Web
 Using patterns to find more instances
Who shot Abraham Lincoln? → X shot Abraham Lincoln
Birthday-of(Mozart, 1756)
Birthday-of(Einstein, 1879)
Wolfgang Amadeus Mozart (1756 - 1791)
Einstein was born in 1879
PERSON (DATE –
PERSON was born in DATE
(Brill et al., TREC 2001; Lin, ACM TOIS 2007)
(Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; … )

The iSchool
2. Large Data Centers
 Web-scale problems? Throw more machines at it!
 Clear trend: centralization of computing resources in large
data centers
 Necessary ingredients: fiber, juice, and space
 What do Oregon, Iceland, and abandoned mines have in
common?
 Important Issues:
 Redundancy
 Efficiency
 Utilization
 Management

Source: Harper’s (Feb, 2008)

The iSchool
Key Technology: Virtualization
Hardware
Operating System
App App App
Traditional Stack
Hardware
OS
App App App
Hypervisor
OS OS
Virtualized Stack

The iSchool
3. Different Computing Models
 Utility computing
 Why buy machines when you can rent cycles?
 Examples: Amazon’s EC2, GoGrid, AppNexus
 Platform as a Service (PaaS)
 Give me nice API and take care of the implementation
 Example: Google App Engine
 Software as a Service (SaaS)
 Just run it for me!
 Example: Gmail
“Why do it yourself if you can pay someone to do it for you?”

The iSchool
4. Web Applications
 A mistake on top of a hack built on sand held together by
duct tape?
 What is the nature of software applications?
 From the desktop to the browser
 SaaS == Web-based applications
 Examples: Google Maps, Facebook
 How do we deliver highly-interactive Web-based
applications?
 AJAX (asynchronous JavaScript and XML)
 For better, or for worse…

The iSchool
What is the course about?
 MapReduce: the “back-end” of cloud computing
 Batch-oriented processing of large datasets
 Ajax: the “front-end” of cloud computing
 Highly-interactive Web-based applications
 Computing “in the clouds”
 Amazon’s EC2/S3 as an example of utility computing

The iSchool
Amazon Web Services
 Elastic Compute Cloud (EC2)
 Rent computing resources by the hour
 Basic unit of accounting = instance-hour
 Additional costs for bandwidth
 Simple Storage Service (S3)
 Persistent storage
 Charge by the GB/month
 Additional costs for bandwidth
 You’ll be using EC2/S3 for course assignments!

The iSchool
This course is not for you…
 If you’re not genuinely interested in the topic
 If you’re not ready to do a lot of programming
 If you’re not open to thinking about computing in new ways
 If you can’t cope with uncertainly, unpredictability, poor
documentation, and immature software
 If you can’t put in the time
Otherwise, this will be a richly rewarding course!

Source: http://davidzinger.wordpress.com/2007/05/page/2/

The iSchool
Cloud Computing Zen
 Don’t get frustrated (take a deep breath)…
 This is bleeding edge technology
 Those W$*#T@F! moments
 Be patient…
 This is the second first time I’ve taught this course
 Be flexible…
 There will be unanticipated issues along the way
 Be constructive…
 Tell me how I can make everyone’s experience better

The iSchool
Things to go over…
 Course schedule
 Assignments and deliverables
 Amazon EC2/S3

The iSchool
Web-Scale Problems?
 Don’t hold your breath:
 Biocomputing
 Nanocomputing
 Quantum computing
 …
 It all boils down to…
 Divide-and-conquer
 Throwing more hardware at the problem
Simple to understand… a lifetime to master…

The iSchool
Divide and Conquer
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker” “worker” “worker”
Partition
Combine

The iSchool
Different Workers
 Different threads in the same core
 Different cores in the same CPU
 Different CPUs in a multi-processor system
 Different machines in a distributed system

The iSchool
Choices, Choices, Choices
 Commodity vs. “exotic” hardware
 Number of machines vs. processor vs. cores
 Bandwidth of memory vs. disk vs. network
 Different programming models

The iSchool
Flynn’s Taxonomy
Instructions
Single (SI) Multiple (MI)
Data
Multiple(M
SISD
Single-threaded
process
MISD
Pipeline
architecture
SIMD
Vector Processing
MIMD
Multi-threaded
Programming
Single(SD)

The iSchool
SISD
D D D D D D D
Processor
Instructions

The iSchool
SIMD
D0
Processor
Instructions
D0D0 D0 D0 D0
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D1
D2
D3
D4
…
Dn
D0

The iSchool
MIMD
D D D D D D D
Processor
Instructions
D D D D D D D
Processor
Instructions

The iSchool
Memory Typology: Shared
Memory
Processor
Processor Processor
Processor

The iSchool
Memory Typology: Distributed
MemoryProcessor MemoryProcessor
MemoryProcessor MemoryProcessor
Network

The iSchool
Memory Typology: Hybrid
Memory
Processor
Network
Processor
Memory
Processor
Processor
Memory
Processor
Processor
Memory
Processor
Processor

The iSchool
Parallelization Problems
 How do we assign work units to workers?
 What if we have more work units than workers?
 What if workers need to share partial results?
 How do we aggregate partial results?
 How do we know all the workers have finished?
 What if workers die?
What is the common theme of all of these problems?

The iSchool
General Theme?
 Parallelization problems arise from:
 Communication between workers
 Access to shared resources (e.g., data)
 Thus, we need a synchronization system!
 This is tricky:
 Finding bugs is hard
 Solving bugs is even harder

The iSchool
Managing Multiple Workers
 Difficult because
 (Often) don’t know the order in which workers run
 (Often) don’t know where the workers are running
 (Often) don’t know when workers interrupt each other
 Thus, we need:
 Semaphores (lock, unlock)
 Conditional variables (wait, notify, broadcast)
 Barriers
 Still, lots of problems:
 Deadlock, livelock, race conditions, ...
 Moral of the story: be careful!
 Even trickier if the workers are on different machines

The iSchool
Patterns for Parallelism
 Parallel computing has been around for decades
 Here are some “design patterns” …

The iSchool
Master/Slaves
slaves
master

The iSchool
Producer/Consumer Flow
CP
P
P
C
C
CP
P
P
C
C

The iSchool
Work Queues
CP
P
P
C
C
shared queue
W W W W W

The iSchool
Rubber Meets Road
 From patterns to implementation:
 pthreads, OpenMP for multi-threaded programming
 MPI for clustering computing
 …
 The reality:
 Lots of one-off solutions, custom code
 Write you own dedicated library, then program with it
 Burden on the programmer to explicitly manage everything
 MapReduce to the rescue!
 (for next time)

Cloud Computing

More Related Content

Viewers also liked (7)

Similar to Cloud Computing (20)

More from Rahul Pola (11)

Recently uploaded (20)

Cloud Computing