BCS601 Module 5 PDF
BCS601 Module 5 PDF
Module 5
Cloud Programming and Software Environments
This section provides a summary of key features found in real-world cloud and grid platforms.
We present this information across four tables, each focusing on a different aspect: capabilities,
traditional features, data-related features, and those relevant to programmers and runtime
systems. These tables serve as a referenced guide for developers aiming to efficiently program
and utilize cloud infrastructure.
Commercial cloud platforms are designed to offer broad capabilities, as summarized in Table
6.1. These capabilities enable cost-effective utility computing with the elasticity to scale
resources up or down based on demand. Beyond this core characteristic, commercial clouds
increasingly provide additional services under the umbrella of Platform as a Service (PaaS).
For example, Microsoft Azure includes platform services such as Azure Table, queues, blobs,
SQL Database, and both Web and Worker roles. While Amazon Web Services (AWS) is
traditionally associated with Infrastructure as a Service (IaaS), it has steadily expanded its
platform offerings to include SimpleDB (analogous to Azure Table), message queues,
notification services, monitoring tools, content delivery networks, relational databases, and
MapReduce support through Hadoop. Google, although not offering a comprehensive cloud
service like Azure or AWS, provides the Google App Engine (GAE), a robust environment for
developing and hosting web applications.
Table 6.2 outlines various low-level infrastructure features. Table 6.3 presents traditional
programming models and environments used for parallel and distributed systems—capabilities
that are increasingly expected in cloud platforms, either as part of the system or user
environment. Table 6.4 highlights newer features emphasized by commercial cloud providers
and, to a lesser extent, some grid systems. Many of these features have only recently seen wide
adoption and are not yet available in most academic cloud infrastructures such as Eucalyptus,
Nimbus, OpenNebula, or Sector/Sphere (though Sector, a data-parallel file system or DPFS, is
categorized in Table 6.4).
vtucircle.com Page 1
Module 5- Cloud Programming and Software Environments
6.1.2.1 Workflow
• Data transfer in and out of commercial clouds can be slow and costly.
• Uses simple protocols like HTTP.
• High-speed links may be introduced for better performance in national infrastructure.
vtucircle.com Page 2
Module 5- Cloud Programming and Software Environments
• Cloud data (e.g., Azure blobs) supports parallel processing.
vtucircle.com Page 3
Module 5- Cloud Programming and Software Environments
6.1.3.6 Queuing Services
• Amazon and Azure provide message queues for communication between application
components.
• REST interfaces, “deliver-at-least-once” semantics.
• Alternatives: ActiveMQ, NaradaBrokering.
• Azure uses Worker roles for background tasks and Web roles for web portals.
• No need for explicit task scheduling.
• Uses queues for distributed task management.
6.1.4.2 MapReduce
vtucircle.com Page 4
Module 5- Cloud Programming and Software Environments
• Registry: Information resource for system (system version of meta data management)
• Security: Security features other than basic authentication and authorization; includes
higher level concepts such as trust
• Scheduling: Basic staple of Condor, Platform, Oracle Grid Engine,etc.; clouds have this
implicitly as is especially clear with Azure Worker Role
• Gang scheduling: Assigns multiple (data -parallel) tasks in a scalable fashion; note that
this is provided automatically by MapReduce
• Software as a Service (SaaS): Shared between clouds and grids, and can be supported
without special attention; Note use of services and corresponding service oriented
architectures are very successful and are used in clouds very similarly to previous
distributed systems.
• Virtualization: Basic feature of clouds supporting “elastic” feature highlighted by
Berkeley as characteristic of what defines a (public) cloud; includes virtual networking as
in ViNe from University of Florida
vtucircle.com Page
5
Module 5- Cloud Programming and Software Environments
• Blob:
o Basic cloud storage (e.g., Azure Blob, Amazon S3).
o Used for storing large unstructured data like images, backups, and videos.
• DPFS (Distributed Parallel File Systems):
o Examples: Google File System, HDFS (Hadoop), Cosmos (Dryad).
o Designed with compute-data affinity for efficient large-scale data processing.
• SQL:
o Traditional relational database support (e.g., MySQL, Oracle).
o Offered by both Amazon and Azure.
• Table (NoSQL):
o Schema-free data structures like Amazon SimpleDB, Azure Table, Apache
HBase.
o Part of the NoSQL movement focused on scalability and flexibility.
• MapReduce:
o Programming model for distributed data processing.
o Examples: Hadoop (Linux), Dryad (Windows), Twister.
o Related languages: Sawzall, Pregel, Pig Latin, LINQ.
• Programming Model:
o Cloud programming built on familiar web/grid paradigms.
o Integrates other platform features like data and task management.
• Worker Role:
o Concept from Azure for background task execution.
o Implicitly used in both Amazon and grid systems.
• Web Role:
o Used in Azure to handle web interfaces or portals.
o Comparable to Google App Engine (GAE) functionality.
• Fault Tolerance:
o Key feature in clouds but largely neglected in traditional grids.
o Enables system resilience against node failures.
• Monitoring:
o Tools like Inca in grid environments.
vtucircle.com Page 6
Module 5- Cloud Programming and Software Environments
o Often implemented using publish-subscribe mechanisms.
• Notification:
o Event or status updates delivered via publish-subscribe systems.
• Queues:
o Used for communication and task coordination between services.
o Often based on publish-subscribe models.
• Scalable Synchronization:
o Tools like Apache Zookeeper or Google Chubby.
o Supports distributed locks and coordination.
o Used in BigTable but unclear for Azure Table or SimpleDB.
Benefits
Challenges
• Managing these systems is complex due to task coordination, data sharing, and
communication.
1. Partitioning
• Computation Partitioning: Break the program into smaller tasks that can run at the
same time.
vtucircle.com Page
7
Module 5- Cloud Programming and Software Environments
• Data Partitioning: Split input data into chunks that different parts of the program can
process in parallel.
2. Mapping
• Assign the smaller tasks or data pieces to specific computing resources (like assigning
jobs to workers).
3. Synchronization
4. Communication
• When tasks need to share data, they communicate over the network (especially in
distributed systems).
5. Scheduling
• Decides which task runs when, especially when there are more tasks than available
workers.
• Managing all of the above manually is hard and slows down development.
• Programming paradigms/models help by hiding complex details from the
programmer.
Popular Paradigms/Models
vtucircle.com Page 8
Module 5- Cloud Programming and Software Environments
o More fault-tolerant and scalable than older models like MPI (Message Passing
Interface).
The section 6.2.2 from your document elaborates on MapReduce, an important framework for
processing large-scale data in parallel across distributed systems, and introduces its extension
Twister for iterative computations. Here's a summarized and simplified explanation:
What is MapReduce?
Key Features:
vtucircle.com Page
9
Module 5- Cloud Programming and Software Environments
This diagram (Figure 6.1) illustrates the MapReduce framework, which processes large-scale
data using a distributed computing model. Let's break down each component:
1. Input Files
• These are the raw data chunks (e.g., text files, logs) stored in a distributed file system
(like HDFS).
• The framework splits the input data and distributes it to multiple Map tasks.
2. Map Function
• Each Map task reads a portion of the input data and processes it into intermediate key-
value pairs.
• Example: For a word count task, it might emit pairs like (word, 1) for each word found.
4. Reduce Function
• Takes each key and a list of its values (e.g., (word, [1,1,1])) and performs aggregation or
summarization (e.g., total count → (word, 3)).
• Produces the final output results.
5. Output Files
• The final reduced data is written to output files, usually stored again in a distributed file
system.
6. Controller / MapReduceLibrary
vtucircle.com Page 10
Module 5- Cloud Programming and Software Environments
7. User Interfaces
In Simple Terms:
vtucircle.com Page
11
Module 5- Cloud Programming and Software Environments
This diagram (Figure 6.2) shows the logical data flow in the MapReduce model, broken down
into 5 processing stages involving successive (key, value) pairs. It provides a deeper look into
how data is transformed through the MapReduce pipeline.
Step-by-Step Explanation
1. Input
• Input data is read as lines of text, each associated with a unique key (often byte offset)
and value (content of the line).
• Format: (key, value) — e.g., (offset, "This is a line of text").
2. Map Stage
4. Group Stage
vtucircle.com Page 12
Module 5- Cloud Programming and Software Environments
o E.g., (key1, [val1, val2, val3, ...])
• This prepares data for aggregation.
5. Reduce Stage
• The Reduce function takes each grouped key and list of values.
• It applies logic (like summing counts or averaging) to output a single (key, value) pair per
group.
o Example: ("word", 5) if the word appeared 5 times.
Final Output
1. Input to Map is a (key, value) pair (e.g., line number and text).
2. Map emits intermediate (key, value) pairs (e.g., (word, 1)).
3. Intermediate pairs are sorted and grouped by key.
vtucircle.com Page 13
Module 5- Cloud Programming and Software Environments
4. Each group like (word, [1,1,1]) is passed to Reduce, which outputs final results like
(word, 3).
Formal Notation:
Solving Strategy
• Key: Unique identifier (e.g., word, word length, sorted letters for anagram)
• Value: What you want to count or process (e.g., 1 for occurrences)
vtucircle.com Page 14
Module 5- Cloud Programming and Software Environments
7. Communication: Reduce workers fetch intermediate data from all Map workers.
8. Sorting & Grouping: Data is grouped by key on each reduce node.
9. Reduce Function: Final aggregation and writing of output.
This diagram (Figure 6.4) illustrates the partitioning function in the MapReduce model —
specifically how MapWorkers assign output to the appropriate ReduceWorkers based on key
values.
Goal of Partitioning:
To distribute the intermediate key-value pairs generated by MapWorkers evenly and correctly to
the appropriate ReduceWorkers.
Components Breakdown
MapWorkers
• Each MapWorker processes a chunk of the input data and emits intermediate (key,
value) pairs.
• These key-value pairs must be passed to the right ReduceWorker.
Partitioning Function
vtucircle.com Page 15
Module 5- Cloud Programming and Software Environments
python
CopyEdit
partition = hash(key) % num_reducers
• Example:
o Keys with hash(key) % 3 == 0 go to Reducer 1
o Keys with hash(key) % 3 == 1 go to Reducer 2
o Keys with hash(key) % 3 == 2 go to Reducer 3
• All MapWorkers use the same partitioning logic to ensure consistency.
Regions
• The diagram uses colored boxes (1, 2, 3) to represent data partitions (or regions).
• Each region is assigned to a specific ReduceWorker.
ReduceWorkers
Summary
Component Role
Analogy:
Imagine 4 teachers (MapWorkers) grading exam papers. Based on a student's ID (key), they use
a rule (Partitioning function) to decide which department head (ReduceWorker) the scores
should be sent to. This ensures each head compiles grades only for their assigned group of
students.
vtucircle.com Page 16
Module 5- Cloud Programming and Software Environments
This diagram (Figure 6.5) illustrates the dataflow implementation in a MapReduce job,
showing detailed internal operations in both Map workers and Reduce workers. It breaks down
the workflow into stages like partitioning, combining, synchronization, communication,
sorting, and reducing.
1. Input Splitting
• Input data is divided into chunks, and each chunk is processed by a MapWorker.
• Each worker receives an Input split.
• Each input record (like a line of text) is passed to the Map function.
• Output: A set of intermediate key-value pairs — e.g., (K1, V1).
vtucircle.com Page 17
Module 5- Cloud Programming and Software Environments
• Output: Fewer intermediate pairs, e.g., (K1, V') instead of multiple (K1, V).
4. Partitioning Function
• Intermediate key-value pairs are sent across the network to Reduce workers.
• During this Shuffle step:
o Data from all MapWorkers is synchronized.
o Each ReduceWorker receives all values for a given key from all Mappers.
6. Reduce Worker
Each ReduceWorker:
Summary Table
Stage Role
Real-World Analogy:
Imagine multiple teachers (MapWorkers) each grading student assignments (Input splits). After
marking:
vtucircle.com Page 18
Module 5- Cloud Programming and Software Environments
• They summarize scores (Combiner).
• Each student's scores go to a specific teacher for final tabulation (Partitioning and
Shuffle).
• The receiving teacher calculates the final grade (Reduce), and stores it (Output).
Compute-Data Affinity
MapReduce tries to send computation to the data (on the same node), not the other way
around—this improves efficiency and reduces network use. Google’s GFS stores files in blocks,
aligning well with this strategy.
Standard MapReduce is not efficient for iterative tasks (like machine learning or graph
processing) because it writes intermediate results to disk.
MapReduce vs MPI
• δ (delta) flow: Represents the minimal update information exchanged in each iteration.
• Full data flow: Transferring all data, even unchanged parts, each time — as done in
MapReduce.
vtucircle.com Page 19
Module 5- Cloud Programming and Software Environments
Why Classic MapReduce is Slow for Iterative Algorithms
MapReduce's architecture:
These changes:
Twister (MapReduce++)
Features:
Referenced Figures:
vtucircle.com Page 20
Module 5- Cloud Programming and Software Environments
This diagram shows the control flow of a MapReduce job—from starting the user program to
executing map and reduce tasks on distributed worker nodes and finally writing output files.
Components Involved
Step-by-Step Breakdown
(1) Start
(2) Fork
vtucircle.com Page 21
Module 5- Cloud Programming and Software Environments
o Multiple worker processes (Map and Reduce workers)
(4) Read
• Each Map worker reads its assigned input split from the input files.
(5) Map
• Each Map worker executes the Map function on its split, producing intermediate (key,
value) pairs.
• Intermediate results from all Map workers are shuffled and transferred (sorted and
grouped by key) to the appropriate Reduce workers.
(11) Reduce
• Reduce workers perform the Reduce function on received grouped data, processing keys
and their associated values.
(12) Write
• Final reduced output is written to output files (e.g., File 1, File 2).
Summary
Stage Function
Userprogram Starts the job and forks workers
Master Coordinates and assigns map/reduce tasks
Workers Read, process (map/reduce), and write data
Communication Shuffle phase moves intermediate data to reducers
vtucircle.com Page 22
Module 5- Cloud Programming and Software Environments
This figure illustrates how the MapReduce framework orchestrates parallel data processing,
automatically handling task assignment, data movement, and job coordination to process
large datasets efficiently.
This figure explains the Twister framework, an enhanced version of MapReduce (often called
MapReduce++) designed for iterative applications, such as K-means clustering, PageRank,
and machine learning algorithms.
1. Configure()
• Loads static data (e.g., constants, static matrices) once before the iterative process starts.
2. Map(key, value)
• Each iteration begins by applying the Map function to dynamic data (e.g., input
samples).
• Generates intermediate (key, value) pairs.
vtucircle.com Page 23
Module 5- Cloud Programming and Software Environments
3. Reduce(key, list<value>)
• The Reduce function receives all values associated with a key, typically to aggregate or
compute a result for that key.
4. Combine(key, list<value>)
5. dfLow (δ-flow)
• Twister introduces δ-flow, a small piece of data (delta) communicated between iterations,
instead of full data like in traditional MapReduce. This improves performance.
• After each Reduce (or Combine), the User Program evaluates results and decides
whether to iterate again (loop back to Map).
7. Close()
• Once convergence or a stopping condition is met, the program finalizes the process and
cleans up.
This part shows how Twister is implemented at runtime with multiple components:
Components:
• MR Daemon (D): Long-running processes managing Map and Reduce workers. Unlike
traditional MapReduce, Twister reuses workers between iterations.
• M / R: Map and Reduce workers.
• MR Driver: Central coordinator responsible for managing iterations, task assignments,
and communications.
• User Program: Drives the iterations, convergence checks, and termination.
• Pub/Sub Broker Network: Used for efficient communication (publish/subscribe
model) between distributed workers.
• Data Splits: Input data is split and fed into the system.
vtucircle.com Page 24
Module 5- Cloud Programming and Software Environments
• File System: Provides access to static and initial input data.
• Data Read/Write: Happens through MR Daemon and workers.
• Communication: δ-flows and task coordination occur over the Pub/Sub network.
Summary
Twister is an efficient iterative MapReduce framework built for tasks that require multiple
passes over data. It:
vtucircle.com Page 25
Module 5- Cloud Programming and Software Environments
Explanation of Figure 6.8: Performance of K-means Clustering for MPI, Twister, Hadoop,
and DryadLINQ
Experiment Overview
vtucircle.com Page 26
Module 5- Cloud Programming and Software Environments
o Highly optimized and low-overhead parallelism
o Transfers only necessary updates (δ flow)
o Ideal for tightly-coupled, compute-intensive tasks
Observations
• All systems show some increase in execution time as the data size grows.
• However:
o MPI and Twister scale much better than Hadoop and DryadLINQ.
o MPI consistently delivers the fastest results.
o Twister offers a balance between ease-of-use (MapReduce-like) and high
performance.
Conclusion
• For iterative machine learning tasks like K-means, traditional MapReduce frameworks
like Hadoop are inefficient.
• Twister improves upon this with support for iteration and in-memory computation.
• MPI remains the most performant but is harder to program and manage.
Figure 6.9 presents a comparison of thread and process structures in four parallel
programming paradigms: Hadoop, Dryad, Twister (MapReduce++), and MPI. Here's a
breakdown of what each model represents and how they differ:
vtucircle.com Page 27
Module 5- Cloud Programming and Software Environments
1. Yahoo Hadoop
2. Microsoft Dryad
3. Twister (MapReduce++)
vtucircle.com Page 28
Module 5- Cloud Programming and Software Environments
Key Observations:
Interpretation:
• Twister performs much faster than Hadoop across all data sizes.
• Hadoop suffers from high overhead due to:
o Writing intermediate data to disk.
o Short-running task management.
• Twister:
o Uses long-running tasks and in-memory communication (no disk I/O between
iterations).
o Excels in iterative applications and large-scale data processing.
Conclusion:
Twister is a highly efficient MapReduce runtime for iterative applications like those in
machine learning or graph analytics. Its performance advantage over Hadoop grows as dataset
size increases, making it a better fit for big data processing at scale.
Summary:
vtucircle.com Page 29
Module 5- Cloud Programming and Software Environments
6.2.3 Hadoop Library from Apache
Apache Hadoop is an open-source framework developed in Java that allows for distributed
processing of large data sets across clusters of computers using a model called MapReduce. It
was designed as an alternative to Google's proprietary systems and includes its own file storage
layer (HDFS) and computation engine (MapReduce engine).
1. MapReduce Engine:
o This is the processing engine that runs computations on the data stored in HDFS.
o It breaks down tasks into Map and Reduce phases for parallel processing.
2. HDFS (Hadoop Distributed File System):
o A specialized file system designed for high-throughput access to large files.
o Inspired by Google’s GFS, it is tailored for large-scale data storage and access.
HDFS Architecture
1. Fault Tolerance
• Block Replication:
o Each block is copied to multiple DataNodes to prevent data loss.
o Default replication factor is 3.
• Replica Placement Strategy:
o 1 copy on the same node as the original.
o 1 copy on a different node within the same rack.
o 1 copy on a node in a different rack (for added reliability).
• Heartbeat and Blockreport Messages:
o Heartbeats: Sent by DataNodes to let the NameNode know they are working.
vtucircle.com Page 30
Module 5- Cloud Programming and Software Environments
o Blockreports: Contain lists of all the blocks stored on each DataNode, helping
the NameNode manage data placement.
2. High Throughput
Reading a File
Writing a File
Summary
Apache Hadoop, through its components MapReduce and HDFS, enables efficient, reliable
processing and storage of massive data sets across clusters. HDFS, with its fault tolerance and
high throughput capabilities, ensures data is stored safely and accessed quickly, making it ideal
for big data applications.
vtucircle.com Page 31
Module 5- Cloud Programming and Software Environments
6.2.3.1 Architecture of MapReduce in Hadoop
MapReduce is the topmost layer in the Hadoop framework. It coordinates and manages the
processing of large data sets using parallel, distributed computing.
Architecture Overview
• Task Slots:
o Each TaskTracker has a fixed number of execution slots based on the node’s
CPU capabilities.
o Example: A node with N CPUs, each supporting M threads, will have N × M
slots.
• Slot Usage:
o Each slot runs one map or reduce task at a time.
o One map task = One data block → This means there's a 1:1 relationship
between map tasks and data blocks stored in HDFS.
vtucircle.com Page 32
Module 5- Cloud Programming and Software Environments
Key Points
This section explains how Hadoop’s MapReduce component works on top of HDFS to process
big data efficiently through distributed execution and parallelism
The diagram in Figure 6.11 illustrates the architecture of Hadoop, specifically showing how
the MapReduce engine and HDFS (Hadoop Distributed File System) interact within a
distributed cluster setup.
Overview
The cluster is organized into multiple nodes, grouped into racks (Rack 1 and Rack 2).
Key Takeaways
• Node 1 is the master node with JobTracker (MapReduce master) and NameNode
(HDFS master).
• Nodes 2, 3, 4 are worker nodes with both TaskTrackers and DataNodes.
• Blocks are stored in DataNodes; Tasks are executed in TaskTrackers.
• The architecture supports fault tolerance, parallel processing, and data locality.
When a MapReduce job is run in Hadoop, three main components are involved:
Step-by-Step Process:
1. Job Submission
vtucircle.com Page 33
Module 5- Cloud Programming and Software Environments
The user submits a job from their node to the JobTracker. Here's what happens:
2. Task Assignment
• The JobTracker:
o Creates one map task per input split.
o Assigns map tasks to TaskTrackers by considering data locality (to reduce data
movement).
o Creates reduce tasks (number is set by the user).
o Assigns reduce tasks to TaskTrackers without locality considerations.
3. Task Execution
• Each TaskTracker:
o Copies the job's JAR file.
o Launches a Java Virtual Machine (JVM).
o Executes the assigned map or reduce task using instructions from the JAR.
• Each TaskTracker:
o Sends periodic heartbeat messages to the JobTracker.
o Heartbeats indicate that the TaskTracker is alive and whether it’s ready for new
tasks.
vtucircle.com Page 34
Module 5- Cloud Programming and Software Environments
Process Flow
1. Job Submission
vtucircle.com Page 35
Module 5- Cloud Programming and Software Environments
• The User submits a MapReduce job to the JobTracker.
• JobTracker:
o Gets metadata from NameNode to know data locations.
o Splits the job into tasks (map/reduce).
2. Task Assignment
• Based on data locality (data block locations known via NameNode), the JobTracker
assigns tasks to the appropriate TaskTrackers.
• Map tasks are scheduled closer to where the data blocks reside (shown in the
DataNodes).
• Reduce tasks are assigned without considering locality.
3. Task Execution
• Each TaskTracker launches one or more Java Virtual Machines (JVMs) to execute:
o Map tasks (left and middle nodes).
o Reduce task (right node).
• JVMs use the code in the submitted JAR to process the task.
4. Heartbeat
Important Notes:
• It allows the user to define their own dataflow using a Directed Acyclic Graph (DAG).
vtucircle.com Page 36
Module 5- Cloud Programming and Software Environments
• Each vertex in the DAG is a computational task (a program).
• Each edge is a data channel between tasks (for communication).
figure6.13 Dryad framework and its job structure, control and data flow
1. Job Definition
2. Execution Components
• Job Manager:
o Constructs the DAG from the user-defined program.
o Schedules tasks on available nodes in the cluster.
o Manages execution but does not handle data movement (avoids becoming a
bottleneck).
• Name Server:
o Maintains a list of available computing resources (cluster nodes).
vtucircle.com Page 37
Module 5- Cloud Programming and Software Environments
o Helps the job manager in resource allocation and understanding network
topology.
3. Deployment
• Job manager maps the logical DAG to physical resources using info from the name
server.
• A daemon runs on each node:
o Acts as a proxy.
o Receives the program binary.
o Executes the assigned tasks.
o Reports status back to the job manager.
Fault Tolerance
Advanced Features
vtucircle.com Page 38
Module 5- Cloud Programming and Software Environments
o MapReduce
o SQL-like processing (via DryadLINQ)
Summary
Feature Dryad
Why DryadLINQ?
DryadLINQ enables regular .NET developers to write scalable distributed programs using
familiar C# and LINQ syntax, without needing to deal with low-level parallelism or distributed
programming challenges.
vtucircle.com Page 39
Module 5- Cloud Programming and Software Environments
It hides the complexity of:
• Task scheduling
• Data partitioning
• Fault tolerance
• Network communication
vtucircle.com Page 40
Module 5- Cloud Programming and Software Environments
• Vertices (tasks) are scheduled and executed on available cluster resources.
6. Vertex Execution
• Each vertex runs its own piece of logic (as compiled earlier).
• These are independent and run in parallel where possible.
7. Output Generation
Summary Table
Step Description
vtucircle.com Page 41
Module 5- Cloud Programming and Software Environments
Step Description
Benefits of DryadLINQ
What is Sawzall?
vtucircle.com Page 42
Module 5- Cloud Programming and Software Environments
o Automatically uses cluster computing and redundant servers for performance and
reliability.
• Fast Data Analysis:
o Transformed batch jobs that took a full day into interactive sessions.
o Enabled new, real-time ways to analyze and utilize big data.
• Open Source:
o Sawzall has been made available as an open source project.
1. Data Partitioning:
o Input data is split and processed locally using Sawzall scripts.
2. Local Processing:
o Each node filters and processes its portion of the data using custom scripts.
3. Aggregation:
o Intermediate results are emitted to aggregators (called tables).
o Final results are generated by collecting data from these aggregators.
2. Input handling:
3. Runtime Translation:
The Sawzall engine translates scripts into MapReduce jobs that run across multiple machines.
Conclusion
vtucircle.com Page 43
Module 5- Cloud Programming and Software Environments
Sawzall provides a powerful abstraction over MapReduce, allowing users to write simple, high-
level scripts for complex data analysis tasks. Its automatic handling of parallelism, fault
tolerance, and aggregation makes it well-suited for large-scale log processing and other similar
applications.
vtucircle.com Page 44
Module 5- Cloud Programming and Software Environments
Table6.8PigLatinDataTypes
DataType Description Example
vtucircle.com Page 45
Module 5- Cloud Programming and Software Environments
Table6.9PigLatinOperators
Command Description
LOAD Readdatafromthefilesystem.
STORE Writedatatothefilesystem.
FOREACHGENERATE
Applyanexpressiontoeachrecordandoutputoneormorer
ecords.
FILTER
Applyapredicateandremoverecordsthatdonotret
urntrue. GROUP/COGROUP Collect records with the same key from one or
more inputs. JOIN Join two or more inputs based on a key.
CROSS Crossproducttwoormore inputs.
UNION Mergetwoormoredatasets.
SPLIT Splitdataintotwoormoresets,basedonfilterconditions.
ORDER Sortrecordsbased ona key.
vtucircle.com Page 46
Module 5- Cloud Programming and Software Environments
DISTINCT Removeduplicatetuples.
STREAM Sendallrecordsthroughauser-providedbinary.
DUMP Writeoutputtostdout.
LIMIT Limitthenumberofrecords.
Summary
This section explains how different types of applications maps to parallel and distributed
systems, based on six distinct application architectures. These helps understand how various
problems can be efficiently executed on different computing models like clusters, grids, and
clouds.
vtucircle.com Page 47
Module 5- Cloud Programming and Software Environments
This sixth category addresses modern big data and data analytics applications that emerged
with MapReduce and its variants. It’s a hybrid of categories 2 and 4 but focused specifically on
data flow.
Subcategories of MapReduce++
1. Map-Only Applications:
o Similar to Category 4.
o Each task reads data, processes it independently, and outputs results.
o No reduce step involved (e.g., log scanning, data filtering).
2. Classic MapReduce:
o File-to-file processing with two phases:
▪ Map phase: processes input data in parallel.
▪ Reduce phase: aggregates intermediate outputs.
o Automatically handles parallelism and fault tolerance.
3. Extended MapReduce:
o More complex patterns like iterations, joins, or graph processing.
o Builds on traditional MapReduce with additional dataflow patterns (covered
earlier in Section 6.2.2).
vtucircle.com Page 48
Module 5- Cloud Programming and Software Environments
Summary of Differences
This classification helps choose the right programming model and infrastructure for a given type
of application—whether it's for simulations, data analysis, or event-driven systems.
Table6.11ComparisonofMapReduce++SubcategoriesalongwiththeLooselySynchronousCategoryUsedinMP
I
Map-Only Classic MapReduce IterativeMapReduce LooselySynchronous
input input input
jmap()
map()
B
map()
reduce() reduce()
output iA Pl
output output
vtucircle.com Page 49
Module 5- Cloud Programming and Software Environments
• PolarGridMatlabdata distancesforsequences •Deterministicannealing dynamicswithshort-
range
analysis(www.polargrid.org) (BLAST) clustering forces
Multidimens
•
ional scaling
(MDS)
DomainofMapReduceandIterativeExtensions → MPI
Table6.10ApplicationClassificationforParallelandDistributedSystems
Machine
Category Class Description Architect
ure
1 Synchronous Theproblemclasscanbeimplementedwith SIMD
instruction-levellockstepoperationasinSIMD
architectures.
2 Loosely Theseproblemsexhibititerativecompute- MIMDon
MPP
synchronous Communicationstageswithindependentcomp (massively
ute
(BSPorbulk (map)operationsforeachCPUthataresynchron parallel
ized
synchronous withacommunicationstep.Thisproblemclass processor)
covers
processing) many successful
MPIapplicationsincludingpartial
differentialequationsolutionsandparticledyna
mics
applications.
3 Asynchronou IllustratedbyComputeChessandInteger Shared
s
Programming;combinatorialsearchisoftensu memory
pported
vtucircle.com Page 50
Module 5- Cloud Programming and Software Environments
bydynamicthreads.Thisisrarelyimportantin
scientificcomputing,butitisattheheartofopera
ting
Systemsandconcurrencyinconsumerapplicat
ions
such as Microsoft Word.
4 Pleasingly Eachcomponentisindependent.In1988,Fox Gridsmovin
g
parallel estimatedthisat20percentofthetotalnumber of toclouds
applications,butthatpercentagehasgrownwith
the
useofgridsanddataanalysisapplicationsinclu
ding,
forexample,theLargeHadronCollideranalysis
for
particlephysics.
5 Metaproblem Thesearecoarse- Gridsof
s grained(asynchronousordataflow)
combinationsofcategories1- clusters
4and6.Thisareahas
Alsogrowninimportanceandiswellsupported
by
gridsanddescribedbyworkflowinSection3.5.
6 MapReduce+ Thisdescribesfile(database)tofile(database) Data-
+ intensive
(Twister) operationswhichhavethreesubcategories(see clouds
also
Table6.11): a)Master-
6a)PleasinglyParallelMapOnly(similartocate workeror
gory4)
6b)Mapfollowedbyreductions MapRedu
ce
6c)Iterative“Mapfollowedbyreductions”(ext b)MapRedu
ensionof ce
currenttechnologiesthatsupportslinearalge c)Twister
bra
anddatamining)
Google App Engine (GAE) is a cloud platform that allows developers to build and host web
applications on Google's infrastructure. It supports languages like Java and Python, and
provides built-in tools for scalable, secure, and efficient cloud development.
• Java Support:
o Eclipse plug-in: Enables local debugging.
o GWT (Google Web Toolkit): Helps develop dynamic web apps in Java.
o Other JVM-based languages like JavaScript, Ruby are also usable via
interpreters.
• Python Support:
o Common frameworks include Django and CherryPy.
o Google provides a lightweight webapp framework for Python.
• GAE Datastore:
o NoSQL, schema-less entity storage.
o Each entity:
▪ Max size: 1 MB.
▪ Identified by key-value properties.
o Querying:
▪ Filtered and sorted by property values.
▪ Strongly consistent using optimistic concurrency control.
• Java APIs:
o Use JDO or JPA via DataNucleus Access Platform.
• Python API:
o Uses GQL (SQL-like query language).
• Transactions:
o Multiple operations can be grouped in a single atomic transaction.
o Operates within entity groups to maintain performance.
o Automatic retries if conflicts occur.
• Memcache:
o In-memory cache to boost performance.
o Works with or without the datastore.
• Blobstore:
o For large files (up to 2 GB).
o Suitable for media content (e.g., videos, images).
vtucircle.com Page 52
Module 5- Cloud Programming and Software Environments
3. Internet & External Services Access
• URL Fetch:
o Allows apps to fetch web resources using HTTP/HTTPS.
o Uses Google’s fast internal network for efficient retrieval.
• Secure Data Connection (SDC):
o Tunnels through the internet to link an intranet with a GAE app.
• Mail Service:
o Enables sending emails from the application.
• Google Data API:
o Access services like Maps, YouTube, Docs, Calendar, etc., within your app.
• Cron Service:
o Schedule tasks periodically (e.g., hourly, daily).
o Ideal for maintenance, data sync, or reporting jobs.
• Task Queues:
o For asynchronous background tasks triggered by application logic.
o Helps offload long-running processes from user requests.
• Usage Limits:
o GAE enforces quotas to prevent overuse and ensure fair resource allocation.
o Free tier available with limits on CPU, storage, bandwidth, etc.
o Ensures cost control and performance isolation between apps.
vtucircle.com Page 53
Module 5- Cloud Programming and Software Environments
Summary
Google App Engine simplifies the deployment of scalable web applications using familiar
languages like Java and Python. It offers powerful features such as built-in data storage,
background processing, and access to Google services — all while managing infrastructure,
scaling, and cost control for you.
vtucircle.com Page 54
Module 5- Cloud Programming and Software Environments
GFS was developed by Google to support the massive data needs of its search engine, designed
specifically for storing and processing huge volumes of data on cheap, unreliable hardware.
1. Design Motivations
2. Key Assumptions
• Single Master:
o Manages metadata (file names, block locations, leases).
vtucircle.com Page 55
Module 5- Cloud Programming and Software Environments
o Simplifies cluster management but is a potential bottleneck.
• Chunk Servers:
o Store actual file data in large chunks (64 MB).
o Each chunk is replicated (default: 3 copies) across servers for fault tolerance.
• Clients:
o Communicate with the master for metadata.
o Access chunk servers directly for reading/writing data.
• Shadow Master:
o Mirrors the main master to recover from master failure.
• Replication:
o Each chunk is stored in at least three servers.
o Can tolerate two simultaneous failures.
• Checksum Verification:
o Each 64 KB sub-block has a checksum for data integrity.
• Fast Recovery:
o Masters and chunk servers restart in seconds.
• If errors occur: Client retries steps 3–7 or restarts the entire process.
vtucircle.com Page 56
Module 5- Cloud Programming and Software Environments
8. Advantages
Summary
GFS is a groundbreaking distributed file system tailored for Google’s massive data needs. It
breaks away from traditional designs to emphasize fault tolerance, high throughput, and
scalability on commodity infrastructure, making it a key foundation for systems like
MapReduce.
vtucircle.com Page 57
Module 5- Cloud Programming and Software Environments
6.3.3 Big Table, Google’s NOSQL System
• Commercial databases can't handle Google's massive scale and performance needs.
• Needed a custom-built system for:
o Billions of records (e.g., URLs).
o High user activity (e.g., thousands of queries/sec).
o Huge data sizes (e.g., >100 TB of geographic data).
4. Conceptual View
• Thousands of servers.
• Terabytes of in-memory data.
• Petabytes of data on disk.
• Self-managing:
vtucircle.com Page 58
Module 5- Cloud Programming and Software Environments
o Dynamic server addition/removal.
o Automatic load balancing.
• Used in Google since 2004.
• One BigTable cell can manage ~200 TB across thousands of machines.
Component Function
GFS (Google File System) Stores persistent data
Scheduler Manages job scheduling for BigTable operations
Lock Service (Chubby) Handles master node elections and service coordination
MapReduce Used for reading/writing and bulk operations on BigTable
7. Summary
BigTable’s data model is simplified yet powerful, designed to handle large-scale, structured
and semi-structured data, such as web pages, user data, and media content.
vtucircle.com Page 59
Module 5- Cloud Programming and Software Environments
Data mapping:
text
CopyEdit
(row: string, column: string, timestamp: int64) → string (cell value)
vtucircle.com Page 60
Module 5- Cloud Programming and Software Environments
o Different content versions of the web page.
o Anchor text and links on the page.
• Multiple versions (with timestamps) are stored in the same cell.
3. Key Features
• BigTable Master:
o Manages metadata and tablet assignments.
o Makes decisions about load balancing.
• Tablet Servers:
o Store and serve tablets to clients.
• Clients:
o Use a BigTable client library to communicate with master and tablet servers.
• Chubby (Distributed Lock Service):
o Handles master election, metadata consistency, and synchronization.
vtucircle.com Page 61
Module 5- Cloud Programming and Software Environments
Summary
BigTable’s model supports high scalability, efficient large-scale data access, and fault tolerance.
It uses a simple yet powerful key-value system enhanced with time and column structure,
making it ideal for applications like search indexing, web crawling, user data storage, and
media metadata.
BigTable uses a three-level hierarchy to locate tablets, ensuring fast and reliable access to data.
Key Features:
Key Functions:
vtucircle.com Page 62
Module 5- Cloud Programming and Software Environments
Architecture (Figure 6.22):
Summary
vtucircle.com Page 63
Module 5- Cloud Programming and Software Environments
Amazon EC2 (Elastic Compute Cloud) is a key component of Amazon Web Services (AWS)
that allows users to rent virtual machines (VMs) for running applications. It was the first cloud
service to offer VM-based application hosting, pioneering the Infrastructure-as-a-Service
(IaaS) model.
1. Core Features
vtucircle.com Page 64
Module 5- Cloud Programming and Software Environments
3. EC2 Instance Classes (Table 6.13)
Class Purpose
1. Standard General-purpose usage.
2. Micro Low-throughput tasks with occasional CPU bursts.
3. High-Memory Suitable for memory-intensive apps like databases.
4. High-CPU Ideal for compute-heavy tasks (e.g., simulations).
5. Cluster HPC and network-intensive workloads using high-speed networking (10
Compute Gbps).
4. Cost Considerations
• EC2 can be used to power a range of apps, from simple websites to complex enterprise
solutions.
• Real-world usage often involves:
o Running databases, web servers, or data processing jobs.
o Scaling up/down based on traffic or computational needs.
Summary
Amazon EC2 offers a flexible, scalable, and cost-efficient platform for hosting applications in
the cloud. It supports a wide range of instance types to suit different workloads and allows users
full control
vtucircle.com Page 65
Module 5- Cloud Programming and Software Environments
ImageType AMIDefinition
Private AMI
Imagescreatedbyyou,whichareprivatebydefault.Youcangr
antaccesstoother users to launch your private images.
Public AMI Images created by users and released to the AWS community, so
anyone can launch
instancesbasedonthemandusethemanywaytheylike.AWSlistsallpubl
icimagesat
http://developer.amazonwebservices.com/connect/kbcategory.jspa?c
ategoryID=171.
Paid QAMI
Youcancreateimagesprovidingspecificfunctionsthatcanbe
launchedbyanyone willing to pay you per each hour of usage
on top of Amazon’s charges.
Table6.13InstanceTypesAvailableonAmazonEC2(October6,2010)
ECUorEC2 Virtual Storage
ComputeInstance MemoryG Units Cores GB 32/64Bit
B
Standard:small 1.7 1 1 160 32
Standard:large 7.5 4 2 850 64
Standard:extralarge 15 8 4 1690 64
Micro 0.613 Up to2 OnlyEBS 32or64
High-memory 17.1 6.5 2 420 64
vtucircle.com Page 66
Module 5- Cloud Programming and Software Environments
High-memory:double 34.2 13 4 850 64
High- 68.4 26 8 1690 64
memory:quadruple
High-CPU:medium 1.7 5 2 350 32
High-CPU:extralarge 7 20 8 1690 64
Clustercompute 23 33.5 8 1690 64
Amazon S3 is a web-based object storage service that allows users to store and retrieve any
amount of data, anytime, from anywhere via web protocols.
Core Concepts
• Object Storage: Each object contains data, metadata, and access control, and is stored in
a bucket.
• Key-Value Access: Each object is accessed via a unique key.
Access Interfaces
Key Features
vtucircle.com Page 67
Module 5- Cloud Programming and Software Environments
EBS offers block-level storage volumes for use with EC2 instances. It's akin to attaching a
virtual hard disk to a virtual machine.
Key Features
• Persistence: Unlike EC2 instance storage, data is retained after the instance is stopped.
• Block Device Interface:
o Volumes from 1 GB to 1 TB.
o Can be formatted with a file system or used directly.
• Multiple Volumes: Can be mounted on the same EC2 instance.
• Snapshots: Incremental backups improve save/restore efficiency.
• Pricing (as of 2010):
o $0.10 per GB/month for storage.
o $0.10 per million I/O requests.
Data Model
vtucircle.com Page 68
Module 5- Cloud Programming and Software Environments
• Domain = Table
• Item = Row
• Attribute = Column
• Value = Cell (can have multiple values per attribute)
• No strict schema or ACID transactions; eventual consistency model.
Use Case
Comparison Summary
Azure is Microsoft’s cloud platform offering virtualized compute, storage, and database
services. It supports scalable application hosting through a role-based architecture and
integrates a range of storage models.
Types of Roles
vtucircle.com Page 69
Module 5- Cloud Programming and Software Environments
1. Web Role:
o Customized VM for hosting web applications (via IIS).
2. Worker Role:
o For general-purpose background processing.
Lifecycle Methods:
Debugging:
2. SQLAzure (6.4.4.1)
Blob Storage:
NoSQL key-value store suitable for metadata and scalable structured data.
vtucircle.com Page 70
Module 5- Cloud Programming and Software Environments
• Each entity (row) has:
o Up to 255 properties.
o PartitionKey: Groups entities for optimized access.
o RowKey: Unique identifier for each entity.
• Max entity size: 1 MB (use blob links for larger data).
• Query Support: ADO.NET, LINQ.
5. Azure Queues
Summary Table
vtucircle.com Page 71
Module 5- Cloud Programming and Software Environments
This section introduces open-source and research-oriented cloud platforms and tools
designed to support cloud programming, VM management, storage, and data processing
across diverse infrastructures.
Eucalyptus
Nimbus
Key Features
vtucircle.com Page 72
Module 5- Cloud Programming and Software Environments
• Client Support: Works with EC2 clients, uses Java Jets3t, boto.
• Resource Management:
o Resource Pool Mode: Direct control of VM nodes.
o Pilot Mode: Works with local LRMS for VM provisioning.
OpenNebula
vtucircle.com Page 73
Module 5- Cloud Programming and Software Environments
• Manages full VM lifecycle and dynamic networking.
• Uses libvirt API, CLI, and cloud drivers for hybrid clouds (e.g., Amazon EC2,
ElasticHosts).
• Supports:
o VM migration, snapshots.
o Capacity scheduler with rank/requirement model.
o Image repository for disk image management.
Sector/Sphere
Sector (Storage)
Sphere (Processing)
Space: Column-based table storage engine in Sector/Sphere, supporting a limited SQL subset.
OpenStack
vtucircle.com Page 74
Module 5- Cloud Programming and Software Environments
▪ AddressingNode (DHCP)
• Includes:
o Proxy Server: Routes data requests.
o Ring: Maps entity names to physical locations.
o Object, Container, Account Servers.
• Supports replication, failure isolation, and heterogeneous storage.
• Objects stored as binary files with extended attributes.
vtucircle.com Page 75
Module 5- Cloud Programming and Software Environments
Aneka is a cloud platform for developing and running parallel and distributed applications, built
on .NET but supports Linux via Mono.
Key Capabilities
1. Build:
o SDK with APIs for app development.
o Deploy on private, public, or hybrid clouds.
2. Accelerate:
o Rapid deployment on multiple runtime environments.
o Dynamically lease public cloud resources to meet QoS/SLA deadlines.
3. Manage:
o GUI + API-based infrastructure monitoring.
o Includes accounting, SLA tracking, and dynamic provisioning.
Comparison Snapshot
vtucircle.com Page 76
Module 5- Cloud Programming and Software Environments
Cloud Programming
Aneka Multi-model programming, SLA-aware scaling
Platform
Aneka is a cloud computing platform that provides a flexible environment for running distributed
applications. It works by connecting multiple physical or virtual machines, each running a
lightweight software called the Aneka container. This container manages the services on each
machine and interacts with the operating system through a component called the Platform
Abstraction Layer (PAL). PAL hides the differences between various operating systems and
helps perform system tasks like monitoring performance and ensuring quality of service.
Aneka’s architecture is made up of three types of services. First are Fabric Services, which
handle core infrastructure tasks like monitoring hardware, managing nodes, and ensuring system
reliability. Second are Foundation Services, which add useful features like storage management,
accounting, billing, and resource reservation—helping both system administrators and
developers. Lastly, Application Services provide the environment needed to run applications,
using the other services to handle tasks like data transfer and performance tracking. Aneka can
run different types of application models like distributed threads, bag of tasks, and MapReduce.
It supports easy customization and expansion through its SDK and Spring framework, allowing
developers to add new features quickly.
vtucircle.com Page 77
Module 5- Cloud Programming and Software Environments
vtucircle.com Page 78
Module 5- Cloud Programming and Software Environments
GoFront Group, a leading Chinese manufacturer of railway equipment, needed to render high-
quality 3D images for the design of high-speed trains and urban transport vehicles using
Autodesk Maya software. Rendering these complex designs on a single 4-core server used to
take about three days, especially since each scene included over 2,000 frames with multiple
camera angles. To solve this problem and speed up the rendering process, GoFront used
Aneka, a cloud software platform, to build a private enterprise cloud by connecting the PCs
within their company network. They used a tool called Aneka Design Explorer, which helps run
the same program many times on different data sets—perfect for rendering many frames in
parallel. A custom interface called Maya GUI was developed to integrate Maya with Aneka,
allowing users to manage parameters, submit rendering tasks, monitor progress, and collect the
results. With just 20 PCs running Aneka, GoFront was able to reduce the rendering time from
three days to just three hours, dramatically improving productivity and design turnaround
time.
Virtual appliances are specially prepared virtual machines (VMs) that contain everything
needed to run an application, including the operating system, libraries, and setup files. These
VMs are designed to run out-of-the-box, meaning they are preconfigured and ready to use
immediately once started. Aneka uses virtual appliances to make application deployment easier,
especially across large, mixed computing environments.
The use of VM technology (like VMware, VirtualBox, or Xen) allows Aneka to create virtual
clusters that are consistent and easy to manage. These virtual appliances reduce software
compatibility issues since the entire software stack is already inside the appliance. Even if the
underlying hardware or operating systems differ, the application runs the same way. This
vtucircle.com Page 79
Module 5- Cloud Programming and Software Environments
approach is especially helpful in grid computing, where systems are spread across different
networks and may face issues like firewalls or NAT. Virtual appliances help simplify
deployment and ensure smooth performance in such distributed environments.
vtucircle.com Page 80