0% found this document useful (0 votes)
22 views141 pages

? Storage - Introduction

It's storage

Uploaded by

pes1202203799
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views141 pages

? Storage - Introduction

It's storage

Uploaded by

pes1202203799
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Storage : Introduction

▪ Data : A collection of raw facts


▪ Types of Data :
▪ Structured Data (Organized in rows and Columns and typically stored using say DBMS)
▪ Unstructured Data (Cannot be organized and stored in rows and columns and difficult to
uniquely id and retrieve and typically stored in object stores)
▪ Data explosion in the last decade is leading to a prediction of 572 Zeta bytes (1021 bytes) by 2030
and expected to 50,000 Zeta bytes by 2050
▪ Storage systems (and we are looking at external storage systems) will need to support the same.
▪ These storage systems are characterized by their
▪ Cost
▪ Speed (access time) or performance
▪ Availability
▪ Scalability
▪ Management
▪ Reliability
CLOUD COMPUTING
Storage : Understanding Data Storage
▪ There are different types of storage based on the format in which the data is held, organized and
presented. These are
▪ File Storage : Organizes and represents data as a hierarchy of files in folders
▪ Data is stored as a single piece of information inside a folder, just like
you’d organize pieces of paper inside a manila folder. When you need
to access that piece of data, your computer needs to know the path
to find it
▪ Data stored in files is organized and retrieved using a limited amount of metadata that tells the
computer exactly where the file itself is kept.
▪ Network Attached Storage (NAS) or the Direct Attached Storage (DAS) are examples of the same.
▪ Scaling with file storage is typically scale out as there is a limit to addition of capacity to a system
https://www.redhat.com/en/topics/data-storage/file-block-object-storage
CLOUD COMPUTING
Storage : Understanding Data Storage (2)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/141
▪ Block Storage : block storage chunks data into arbitrarily organized, evenly sized volumes
▪ Data is chopped into blocks with each block given an unique identifier allowing the
storage system to place the data where-ever convenient
▪ Block storage is often configured to decouple the data from the user’s
environment and when data is requested, the underlying storage software
reassembles the blocks of data from these environments and presents them back to the user
▪ Usually used with SAN environments
▪ It can be retrieved quickly and can be accessed by any environment
▪ It’s an efficient and reliable way to store data and is easy to use and manage.
▪ It works well with enterprises performing big transactions and those that deploy huge databases,
meaning the more data you need to store, the better off you’ll be with block storage
▪ Block storage can be expensive and it has limited capability to handle metadata, which means it
needs to be dealt with in the application or database level
CLOUD COMPUTING
Storage : Understanding Data Storage (3)
▪ Object Storage : manages data and links it to associated metadata
▪ Object storage, also known as object-based storage, is a flat structure with
the data broken into discrete units called objects and is kept in a single
repository (instead of being kept as files in folders or as blocks)
▪ Object storage volumes work as modular units:
▪ Each is a self-contained repository that owns the data, a unique identifier
that allows the object to be found over a distributed system, and the
metadata that describes the data.
▪ Metadata is important and includes details at two levels like age,
privacies/securities, and access contingencies etc. and also have
custom information about the object (data) itself information.
▪ To retrieve the data, the storage operating system uses the metadata and identifiers, which
distributes the load better and lets administrators apply policies that perform more robust
searches.
CLOUD COMPUTING
Storage Architectures to support these
Different ways of connecting storage devices and Servers in an network leads to different storage architecture
Directly Attached Storage (DAS)
Network Attached Storage (NAS)
Storage Area Network (SAN)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/141
Additional SAN
Port & network
(Switch)
CLOUD COMPUTING
Evolution of the Storage Systems influenced by Architecture
CLOUD COMPUTING
Typical Architecture of a Storage Device/Subsystem
▪ Figure shows a controller between the disks and
the ports
▪ External Ports are extended to disks through
Internal IO Channels.
▪ Controller functions help with
▪ Increasing data availability and data access
performance through RAID
▪ Uses caches to speedup read and write
access to the server
▪ Most reasonable disk subsystem would contain
▪ Redundant controllers
▪ Significant Cache
▪ Storage Disks which can support petabytes of
disks
▪ Could weigh over a ton with the size of a
large wardrobe
▪ Consolidated disks which provide better
utilization of disks
CLOUD COMPUTING
RAID – Redundant Array of Independent Disks
RAID is a data storage virtualization technology that combines multiple
physical disk drive components into one or more logical units for the purposes
of data redundancy and performance improvement.
Data is distributed across the drives in one of several ways, referred to as RAID
levels, depending on the required level of redundancy and performance
Eg. RAID 0, RAID 1, RAID 4, ….
SCSI/FC
RAID 0

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/141
CLOUD COMPUTING
Typical Storage devices within this disk subsystems :
CLOUD COMPUTING
Magnetic Disk Structure and Disk Latency
Reading or writing a block involves three steps
1. The disk controller positions the head assembly
at the cylinder containing the track on which
the block is located. The time to do so is the
seek time
2. The disk controller waits while the first sector
of the block moves under the head. This time is
called the rotational latency.
3. All the sectors and the gaps between them
pass under the head, while the disk controller
reads or writes data in these sectors. This delay
is called the transfer time
Disk Latency = Seek time + Rotational latency + Transfer time
Explain the above for newbie in depth with above word and letters and sentences

Of course! I’ll explain your notes clearly and in depth, using simple words, while keeping your structure and sentences mostly intact so it’s beginner-friendly
but still covers everything.

📦 Storage : Introduction
Data:
Data is just a collection of raw facts. For example, numbers, names, or any unprocessed information are all data.
Types of Data:
Structured Data:
This data is organized neatly in rows and columns, just like an Excel sheet or a database (DBMS - Database Management System). It's easy to
search and retrieve.
Unstructured Data:
This data cannot be organized in rows and columns. Things like videos, photos, or emails. They are harder to manage and usually stored in

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/141
object storage systems.
Data Explosion: In the last 10 years, the amount of data has grown massively.
It is predicted that there will be 572 zettabytes (1 ZB = 10²¹ bytes!) by 2030, and by 2050, around 50,000 zettabytes.
So, we need better and bigger storage systems.
Storage Systems (external storage systems): Good storage systems must have:
Cost-effectiveness (Should not be too expensive)
Speed (Fast access, better performance)
Availability (Always ready and available)
Scalability (Able to grow easily as data grows)
Management (Easy to organize and control)

Reliability (Safe and trustworthy storage)

☁️ Cloud Computing: Understanding Data Storage


Storage can be categorized based on how the data is stored, organized, and presented.

📁 File Storage:
Data is stored as files inside folders, like how you organize papers inside a filing cabinet.
To find a file, the computer follows a path (folder > subfolder > file).
It uses minimal metadata (basic info about the file like name, location) to find it.

Examples:
NAS (Network Attached Storage)
DAS (Direct Attached Storage)
Scaling:
File storage typically scales out (adding more devices) because there's a limit to how much you can expand a single system.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/141
🔲 Block Storage:
Data is chopped into smaller pieces called "blocks".
Each block has a unique ID.
The storage system places blocks anywhere it finds convenient.
When you need data, the system reassembles the blocks and gives you the whole file.
Used in:
SAN (Storage Area Networks).
Advantages:
Fast access

Works well with large transactions and big databases


Efficient and reliable
Easy to manage
Disadvantages:
Can be expensive
Limited metadata, meaning the app or database must track extra details.

🗂️ Object Storage:
Data is stored as objects.
An object includes:
The data itself
Metadata (detailed information about the data)
Unique ID (to find it)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/141
Structure:
It’s a flat structure, unlike folders. Everything is stored together in one big storage space.
Metadata Importance:
Basic details (like age, security rules)
Custom details about the data itself
Helps to search faster and manage access better
Examples:
Cloud storage systems like Amazon S3 use object storage.

🛠️ Storage Architectures
Different ways to connect storage devices to servers lead to different architectures:

DAS (Directly Attached Storage):


Storage directly connected to a server.
NAS (Network Attached Storage):
Storage connected over a network.
SAN (Storage Area Network):
A special high-speed network for storage.

🏛️ Evolution of Storage Systems


As new architectures (DAS, NAS, SAN) came, storage systems evolved too — they became faster, more scalable, and smarter.

📦 Typical Architecture of a Storage Subsystem


Controller:
Sits between the disks and ports. It manages how data flows in and out.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/141
Internal IO Channels:
Connect ports to disks inside the storage.
Controller functions:
RAID: Increases data availability and performance by organizing disks cleverly.
Cache: Temporary memory to speed up reads and writes.
Good storage systems will have:

Redundant Controllers (Backup in case one fails)


Large Caches (Faster access)
Lots of Storage Disks (Can hold petabytes of data — that's millions of GB!)
Consolidated Disks (Better disk usage and efficiency)
They can be huge, weighing over a ton, like a giant wardrobe!

🛡️ RAID - Redundant Array of Independent Disks


RAID is a technology that combines multiple disks into one logical unit.
Purpose:
Data Redundancy (Backup)
Performance Improvement
How it works: Data is distributed among multiple drives in different ways, called RAID levels like RAID 0, RAID 1, RAID 4, etc.
Example:

RAID 0: Splits data between two disks for faster performance (but no backup).
RAID 1: Copies data to two disks (backup but slower).

💿 Magnetic Disk Structure and Disk Latency


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/141
Reading or writing data on a disk involves 3 steps:

1. Seek Time:
The disk controller moves the read/write head to the correct cylinder (track location).

2. Rotational Latency:
Wait for the disk to rotate so the right sector (part of the track) comes under the head.

3. Transfer Time:
Actually read or write the data as the disk spins.

Disk Latency = Seek Time + Rotational Latency + Transfer Time

Would you also like me to make a small diagram for File vs Block vs Object Storage? It might help you understand even better! 🚀
Would you like that? 🎯

What is Cloud Storage : Complex service, simple storage


Consider the following files which you see regularly
Operating system
• PC users see a rich, powerful interface
Variable-size files- read, write, append- move, rename- lock, unlock- ...
Fixed-size blocks- read- write
• Hierarchical namespace (directories); can move, rename,
append to, truncate, (de)compress, view, delete files, ...
• But the actual storage device is very simple
• HDD only knows how to read and write fixed-size data blocks
• Translation done by the operating system
CLOUD COMPUTING
What is Cloud Storage : Analogy to cloud storage
Shopping carts
Friend lists
User accounts
Profiles

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/141
...
Web service
Key/value store- read, write- delete
• Many cloud services have a similar structure
• Users see a rich interface (shopping carts, product
categories, searchable index, recommendations, ...)
• But the actual storage service is very simple
• Read/write 'blocks', similar to a giant hard disk
• Translation done by the web service
CLOUD COMPUTING
Cloud Storage
• We have seen the Applications can be thrown into the cloud, and we don’t care what server the application
lands on. This application needs to store/write or read data, there needs to be an ability for the application to
access this seamlessly. This data should also be accessible in case the VM migrates too.
• Cloud Storage provides an ability to an application running some-where, to save data and files in an off-site
location, and access the same either through the public internet or a dedicated private network connection
from where-ever.
• Storage Virtualization helps supporting access, utilization, availability and other features which are needed by
the applications.
• Terminologies like Private cloud, Public cloud, Hybrid Cloud, Internal cloud, external cloud can be applied to
the storage based on the where the storage is located, but finally its just the storage systems having the
capability of virtual storage pool and multi-tenancy.
• Cloud Storage infrastructure includes the hardware and software cloud components. Object based storage is
the prominent approach and access to the infrastructure is via web services API
CLOUD COMPUTING
What is Cloud Storage : Cloud Storage
• Cloud storage is a data storage service model in which data is maintained, managed, and backed
up remotely and made available to users over a network (typically the internet)
• These cloud storage providers are responsible for keeping the data available and accessible, and
the physical environment, protected and running.
• People and organizations buy or lease storage capacity from the providers to store user,
organization, or application data.
• Cloud Storage or cloud enabled storage can also be visualized as virtual storage pool
• Object storage services like Amazon S3 and Microsoft Azure Storage, object storage software like
Openstack Swift, object storage systems like EMC Atmos, EMC ECS and Hitachi Content Platform,

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/141
and distributed storage research projects like OceanStore and VISION Cloud are all examples of
storage that can be hosted and deployed with cloud storage characteristics.
Cloud
Storage
Provider
Client
Storage server
Web server
http Server
(Could be load balanced)
Web Browser
Organization of Data
Storage in the Cloud –Cloud Storage Service
CLOUD COMPUTING
CLOUD COMPUTING
E.g. Architecture of a cloud storage platform
User access layer: an authorized user can log into the cloud storage
platform from any location via a standard public application interface
and access cloud storage
User Access
Layer
Data service layer: deals directly with users and depending
on user demands, different application interfaces can be
developed to provide services such as data storage, space
leasing, public resource, multi-user data sharing, or data backup
Data management: provides the upper layer with a unified
public management interface for different services. With
functions such as user management, security management,
replica management, and strategy management
Data storage: data stored in the system forms a massive pool
and needs to be organized
6
CLOUD COMPUTING
Cloud storage : Cloud Storage Service
Cloud

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/141
Storage
Provider
Pros:
• Lower cost
• Easier management
• Enables sharing and
access from anywhere
Storage server
Web server
Cons:
• Loss of control
• No guarantees of data
availability
• Provider failures
Client
CLOUD COMPUTING
What Constitutes Cloud Storage
1. Dramatic reduction in TCO
• Cuts storage cost by more than 10 times compared to block or file storage
2. Unlimited scalability
• Since built using distributed technologies, has unlimited scalability
• Seamlessly add or remove storage systems from the pool
3. Elasticity
• Storage virtualization decouples and abstracts the storage pool from its physical implementation.
So we can get an virtual elastic (grow and shrink as required) and unified storage pool
4. On-Demand
• Uses a pay-as-you-go model, where you pay only for the data stored and the data accessed. For
a private cloud, there is a minimal cluster to start with, beyond which it is on-demand.
• This can result in huge cost savings for the storage user.
CLOUD COMPUTING
What Constitutes Cloud Storage
5. Universal Access
• Traditional storage has limitations like for block storage, the server needing to be on
the same SAN , but Cloud storage offers flexibility on the number of users and from
where to access the same.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/141
6. Multitenancy
• Cloud Storage is typically multi-tenant and supports centralized management, higher
storage utilization and lower costs
7. Data durability and availability
• Runs on commodity hardware but still highly available even with partial failures of the
storage system supported by software layer providing the availability
8. Usability
9. Disaster Recovery
CLOUD COMPUTING
Cloud Storage Enablers
• Storage like Compute resources is enabled for the cloud by Virtualization
• Storage virtualization could be implemented in hardware or software.
• Storage virtualization could also be implemented in the Server, in the storage device and
in the network carrying the data
• Techniques and components like File Systems, Volume Manager, Logical Volume
Managers enable Storage virtualization in the Server
• Techniques like RAID and Logical Volume Management is also used with Storage
virtualization in the storage device
• We will discuss a few of these enablers now and discuss the Storage Virtualization itself
in the next session.
CLOUD COMPUTING
Enablers for Storage Virtualization – File Systems
▪ Data is separated and grouped into pieces
and given a name called a file
▪ The structure and logic rules used to
manage the groups of information (files)
and their names is what forms a file system
▪ File system is used to control how data is
stored and retrieved.
Without a file system, information placed in
a storage medium
would be one large body of data with no
way to tell where one
piece of information stops and the next
begins.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/141
Block is a basic unit of
storage for IO
(Read/Write) operations
CLOUD COMPUTING
Enablers for Storage Virtualization – File Systems
▪ Local file systems could be seen in this fashion
▪ File systems
▪ Can be used on numerous different types of storage devices that
uses different kinds of media
▪ There are also many different kinds of file systems.
▪ Some filesystems are used on local data storage devices and others provide
file access via a network interface and is responsible for arranging
storage space;
▪ Network file systems make local files and directories available over
the LAN. Several end users can thus work on common files
▪ Network File Systems supports applications to share and access files remotely
from various computers
▪ NFS functionality (in a typical server dedicated to host files and directories and making
them available to be accessed across the network), has a daemon process called nfsd.
Server administrator exports the directories and advertises the same in a configuration file.
▪ NFS client requests to the exported directories with the mount command
(End users)
▪ Once mounted, its transparent on where the data is being accessed and can be controlled by permissions
CLOUD COMPUTING
Enablers for Storage Virtualization– Logical Volume Manager (LVM)
▪ Independent layer between the File System and the disk
drives
▪ You could create partitions on the physical disk and create
physical volumes.
▪ These physical volumes could be grouped into a volume
group
▪ This volume group could be broken up into logical volumes
▪ File system can be created on these Logical volumes and
mounted
LVM

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/141
https://opensource.com/business/16/9/linux-users-guide-lvm
CLOUD COMPUTING
Enablers for Storage Virtualization
Thin Provisioning or Virtual Provisioning
 Capacity-on-demand from a shared storage pool
 Logical units presented to hosts have more capacity than physically allocated
(The physical resources are thinly or virtually provisioned)
 Physical storage is allocated only when the host requires it
 Provisioning decisions not bound by currently available storage
Host
Reported
Capacity
Allocated
Allocated
Allocated
Shared
Storage Pool
Storage perceived by the application is larger than physically allocated storage
Explain the above for newbie in depth with above word and letters and sentences

Alright! Let’s go through all of this carefully, simply, and step-by-step, exactly using your words and notes, but explained like you’re new to this topic.

What is Cloud Storage: Complex Service, Simple Storage


First, think about how your computer stores files.

You see a beautiful interface where you can:


Read files

Write to files
Rename files

Move files
Lock/unlock files
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/141
Organize files into folders (directories)
It looks easy and powerful to you.

BUT, the real storage device (like your hard disk (HDD)) is very simple inside.

It can only read and write fixed-size blocks of data (small chunks).
Your Operating System (like Windows, Linux) does the hard work of translating your folders, file names, moving, locking into small reads and writes of
fixed-size blocks.

Analogy to Cloud Storage


Now, think about cloud services like:

Shopping carts (on Amazon)


Friend lists (on Facebook)

User accounts (on any app)

When you use them:

You see a rich and powerful interface.

But in reality, in the backend, the system only does simple things:
Read, write, delete small chunks of data (just like a hard drive).

Web services (the cloud) are doing the translation work.

➡️ So Cloud Storage is like a giant hard disk in the sky.


➡️ The web services make it look fancy to you!
Cloud Storage
Cloud Storage is where applications can save or read files from somewhere far away (off-site) through:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/141
The Internet (public)

Or a private network (secure)

Even if your app moves from one server to another, the data stays available.

Storage Virtualization makes it possible.

This hides the details of the physical disks and gives you a virtual pool of storage.
Private Cloud, Public Cloud, Hybrid Cloud refer to where and how this storage is placed:
Private = inside company
Public = AWS, Google Cloud, etc.

Hybrid = mix of both.


Object Storage is the main way this storage is organized today.
Access is usually via APIs (web interfaces).

Architecture of Cloud Storage


Imagine a basic structure:

1. User Access Layer:


You log in from anywhere.
You interact with storage easily.

2. Data Service Layer:


Takes your commands and performs:
Storing files
Sharing files

Backing up data
Renting storage space

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/141
3. Data Management:
Controls who can access what.

Handles security, replication (copying), backup policies.


4. Data Storage Layer:
The actual disks, where massive pools of data are physically stored.

Pros and Cons of Cloud Storage


Pros:

Very cheap (Lower cost).


Very easy to manage.
Access your data from anywhere!

Cons:

You lose control of your data (someone else is managing the disks).
Sometimes providers might fail or crash (risk).

What Constitutes (makes up) Cloud Storage?


1. Dramatic reduction in TCO (Total Cost of Ownership):

Much cheaper than traditional storage.


2. Unlimited scalability:
You can add/remove storage anytime.

No limits because it's distributed across many servers.


3. Elasticity:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/141
The storage can grow or shrink as needed.
No fixed size limits.
4. On-Demand:

Pay only for what you use (storage and access).


Huge cost savings.
5. Universal Access:

You can access your cloud storage from any device anywhere.
6. Multitenancy:
Multiple users can share the same infrastructure securely.
Better utilization and lower costs.

7. Data Durability and Availability:


Even if a few disks fail, your data is safe because it’s copied (replicated).
8. Usability & Disaster Recovery:

Easy to use.
Automatic backups protect against disasters.

Cloud Storage Enablers (What Makes Cloud Storage Possible)


The magic behind cloud storage is Virtualization — making physical storage act like a virtual pool.

Storage virtualization can happen in:

Server
Storage Device
Network

Important tools used:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/141
File Systems

Breaks storage into pieces called files.


Names these files and organizes them.
Without a filesystem, your storage would just be a giant mess.

Example: NTFS, ext4.

Local filesystems manage your local disk.


Network File Systems (like NFS) allow you to share files across different computers over a network.
Process:

Server exports a folder.


Client mounts it.
Now users can access the remote files as if they are local.

Logical Volume Manager (LVM)

Problem: If you work directly on disks, resizing and managing them is hard.

Solution: Use LVM:

Divide your physical disk into partitions ➔ physical volumes.


Group physical volumes into a volume group.
Split the group into logical volumes.

Create a filesystem on logical volumes.

➡️ Much easier to grow, shrink, and manage storage!


Thin Provisioning (Virtual Provisioning)

Normal storage:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/141
You allocate physical space immediately (even if you don't use it).

Thin Provisioning:

You pretend to give large space to users.

Only give real physical space when they actually use it.

Example:

You promise 100GB but only actually give 10GB now.

As users fill more, you add more space behind the scenes.

➡️ Smart, saves cost, and users feel they have more space than actually allocated!
🎯 Summary
Cloud Storage is just like a simple, massive hard drive in the sky.

It's managed by someone else (cloud provider).

You access it through fancy web interfaces and apps.


Inside, it’s simple blocks of data being read and written.
Virtualization and smart organization (file systems, LVM, thin provisioning) make it flexible, scalable, and cheap.

Would you like me to also create a simple diagram to visualize this whole flow for you? 📚🎨
It could make it even easier! 🚀

Storage –Virtualization
CLOUD COMPUTING
• Storage Virtualization is a means through which physical storage subsystems
(disks, tapes ..) are abstracted from the user’s application and presented as

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/141
logical entities, hiding the underlying complexity of the storage subsystems
and nature of access, network or changes to the physical devices.
• Its the process of aggregating the capacity of multiple storage devices
into storage pools
• Aggregates multiple resources as one addressable entity (pool) or divides a
resource to multiple addressable entities and enables easy provisioning of the right
storage for performance or cost.
• E.g. Single virtual large disks from multiple small disks or Many smaller virtual disks from a large disk
• Virtualization of storage helps achieve location independenceby abstracting the physical location of the data. The
virtualization system/layer presents to the user a logical space for data storage and handles the mapping to the actual
physical location.
• Virtualization software is responsible for maintaining a consistent view of all of the mapping information and keeping
it consistentt. This mapping information is often part of called metadata.
• Virtualization layer can be in H/W or in S/W
Physical Storage
Servers
Virtual
Storage
Storage Virtualization
Layer
CLOUD COMPUTING
Storage – Categories of Storage Virtualization
1. File level Virtualization
• A file system virtualization provides an abstraction of a file system
to the application (with a standard file-serving protocol interface
such as NFS or CIFS) and manages changes to distributed storage
hardware underneath the file system implementation.
• Eliminates the dependencies between the data accessed at the
file level and the location where the files are physically stored.
• Virtualization layer manages files, directories or file systems
across multiple servers and allows administrators to present
users with a single logical file system.
• A typical implementation of a virtualized file system is as a
network file system that supports sharing of files over a
standard protocol with one or more file servers enabling

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/141
access to individual files.
• File-serving protocols that are typically employed are NFS,
CIFS and Web interfaces such as HTTP or WebDAV
Adv: This provides opportunities to optimize storage usage,
server consolidation & non-disruptive file migrations.
CLOUD COMPUTING
Storage : 1. File Level Virtualization
Distributed File System
• A distributed file system (DFS) is a network file system wherein the file system is
distributed across multiple servers.
• DFS enables location transparency and file directory replication as well as tolerance to
faults.
• Some implementations may also cache recently accessed disk blocks for improved
performance. Though distribution of file content increases performance considerably,
efficient management of metadata is crucial for overall file system performance
• Two important techniques for managing metadata for highly scalable file virtualization:
a. Separate data from metadata with a centralized metadata server (used in Lustre)
b. Distribute data and metadata on multiple servers (used in Gluster)
CLOUD COMPUTING
Storage : 1. File Level Virtualization – Distributed File System
Distributed File Systems with Centralized Metadata
• A centralized metadata management scheme achieves scalable DFS with a dedicated
metadata server to which all metadata operations performed by clients are directed.
• Lock-based synchronization is used in every read or write operation from the clients.
• In centralized metadata systems, the metadata server can become a bottleneck if there are
too many metadata operations.
• For workloads with large files, centralized metadata systems perform and scale very well
CLOUD COMPUTING
Storage : 1. File Level Virtualization– Distributed File System
Lustre- Distributed File Systems with Centralized Metadata
• Lustre is a massively parallel, scalable distributed file system for Linux which employs a cluster-based
architecture with centralized metadata.
• This is a software solution with an ability to scale over thousands of clients for a storage capacity of petabytes
with high performance I/O throughput.
• The architecture of Lustre includes the following three main functional components, which can either be on

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/141
the same nodes or distributed on separate nodes communicating over a network
1.
2.
3.
Object storage servers (OSSes), which store file data on object storage targets (OSTs).
A single metadata target (MDT) that stores metadata on one or more Metadata servers (MDS)
Lustre Clients that access the data over the network using a POSIX interface (i.e. standard set of APIs)
CLOUD COMPUTING
Storage : 1. File Level Virtualization – Distributed File System
Lustre architecture (Functioning)
• When a client accesses a file, it does a filename lookup on a
MDS.
• Then, MDS creates a metadata file on behalf of the client or
returns the layout of an existing file.
• The client then passes the layout to a logical object volume
(LOV) for read or write operations.
• The LOV maps the offset and size to one or more objects,
each residing on a separate OST.
• The client then locks the file range being operated on and
executes one or more parallel read or write operations
A logical object volume (LOV) aggregates the Object
Storage Clients to provide transparent access across all
the OSTs.
directly to the OSTs.
CLOUD COMPUTING
Storage : 1. File Level Virtualization – Distributed File System
Distributed File Systems with Distributed Metadata
• In distributed metadata management, metadata is distributed across all nodes in the system,
rather than using centralized metadata servers.
• Such systems have greater complexity than centralized metadata systems, since the
metadata management is spread over all the nodes in the system.
CLOUD COMPUTING
Storage : 1. File Level Virtualization – Distributed File System
GlusterFS - Distributed File Systems with Distributed Metadata
• GlusterFS is an open-source, distributed cluster file system without a centralized metadata server.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/141
• It is also capable of scaling to thousands of clients, Petabytes of capacity and is optimized for high performance.
• GlusterFS employs a modular architecture with a stackable user-space design. It aggregates multiple storage bricks on
a network (over Infiniband RDMA or TCP/IP interconnects) and delivers as a network file system with a global name
space.
• It consists of just two major components: a Client and a Server.
• The Gluster server clusters all the physical storage servers and exports the combined diskspace of all servers as a
Gluster File System.
• The Gluster client is actually optional and can be used to implement highly available, massively parallel access to
every storage node and handles failure of any single node transparently
CLOUD COMPUTING
Storage : 1. File Level Virtualization – Distributed File System
Gluster FS architecture
• GlusterFS uses the concept of a storage brick consisting of a server that is attached to storage directly (DAS) or through a
SAN.
• Local file systems (ext3, ext4) are created on this storage.
• Gluster employs a mechanism called translators to implement the file system
capabilities.
• Translators are programs (like filters) inserted between the actual content of a
file and the user accessing the file as a basic file system interface
• Each translator implements a particular feature of GlusterFS.
• Translators can be loaded both in client and server side appropriately to improve or achieve new functionalities
• Gluster performs very good load balancing of operations using the I/O Scheduler translators
• GlusterFS also supports file replication with the Automatic File Replication (AFR) translator, which keeps identical copies of
a file/directory on all its subvolumes
CLOUD COMPUTING
Storage : 2. Block Virtualization
• Block level Virtualization Virtualizes multiple physical disks and presents the same as a single logical disk.
• The data blocks are mapped to one or more physical disks sub-systems.
• These block addresses may reside on multiple storage sub-systems, appearing however as a single storage
(logical) storage device.
• Block level storage virtualization can be performed at three levels:
a. Host-Based
b. Storage Level
c.
Network level

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/141
CLOUD COMPUTING
Storage : 2. Block Virtualization
Host-Based block virtualization
• Uses a Logical Volume Manager (LVM), a virtualization layer that supports creation of a
Storage pool by combining multiple disks (as in the picture) that is greater than the size of a
physical disk from where the logical storage is created.
• This also allows transparent allocation and management of disk space for file systems or raw
data with capabilities to dynamically shrink or increase physical volumes.
Storage-Device level virtualization
• Creates Virtual Volumes over the physical storage space of the specific storage subsystem.
• Storage disk arrays provide this form of virtualization using RAID techniques. Array controllers
create Logical UNits (LUNs) spanning across multiple disks in the array in RAID Groups. Some
disk arrays also virtualize third-party external storage devices attached to the array.
• This technique is generally host-agnostic and has low latency since the virtualization is a part
LVM
of the storage device itself and in the firmware of the device.
CLOUD COMPUTING
Storage : 2. Block Virtualization
Network-Based block virtualization
• This is the most commonly implemented form of scalable virtualization.
• Virtualization functionality is implemented within the network connecting hosts
and storage, say a Fibre Channel Storage Area Network (SAN).
• There are broadly two categories based on where the virtualization functions
are implemented: either in switches (routers) or in appliances (servers).
• In a switch-based network virtualization, the actual virtualization occurs in an
intelligent switch in the fabric and the functionality is achieved when it works in
conjunction with a metadata manager in the network. Example: HP StorageWorks
SAN Virtualization Services Platform
• In an appliance-based approach, the I/O flows through an appliance that controls the virtualization layer.
Example: IBM SAN Volume Controller
CLOUD COMPUTING
Storage : 2. Block Virtualization
Switch-based network virtualization
Appliance-based network virtualization
CLOUD COMPUTING

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/141
Storage : 2. Block Virtualization
Network-Based block virtualization
• There are broadly two variations of an appliance-based implementation.
• The appliance can either be in-band or out-of-band.
• In in-band, all I/O requests and their data pass through the virtualization device and the clients do not
interact with the storage device at all. All I/O is performed by the appliance on behalf of the clients
• In out-of-band usage, the appliance only comes in between for metadata management (control path),
Explain the above for newbie in depth with above word and letters and sentences

Of course! Let’s go slowly and deeply through your material — using simple words, same terms, and easy-to-follow sentences — so even a newbie can
understand everything you pasted.

📚 Storage Virtualization (Simple, Deep Explanation)


🌩️ What is Storage Virtualization?
In your computer, data is stored on disks or tapes (physical devices).
Storage virtualization hides all the complexity of these devices.
To your application (software), it looks like one simple, big storage space (logical storage).
It groups many small devices into one big pool or splits a big device into smaller pools.

Example:

Imagine you have 10 small hard disks → virtualization can combine them into 1 big storage.
Or you have 1 huge hard disk → virtualization can split it into 10 smaller storage spaces.

✅ Benefits: Easy to manage, flexible, and no need to worry where the data is physically stored.
🗂️ Categories of Storage Virtualization
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 27/141
There are two main categories:

1. File Level Virtualization


2. Block Level Virtualization

1️⃣ File Level Virtualization (Working with FILES)


What is File Level Virtualization?
Your application (like Word, Photoshop, or Chrome) usually deals with files (documents, videos, etc.)
File level virtualization makes files accessible, even if they are stored on different servers or disks.

You see one big file system, not multiple scattered devices.

How?

It uses common file-sharing protocols like:

NFS (Network File System — for Linux/Unix)


CIFS (Common Internet File System — for Windows)
HTTP or WebDAV (for web access)

Benefit:
✔️You can move files behind the scenes without stopping users.
✔️You can optimize storage use easily.

📂 Distributed File System (DFS)


DFS spreads the file system across many servers.
You can:
See all files as if they are from one place.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 28/141
Handle server failures easily.

Improve performance by accessing files faster.

There are two types of DFS:

(a) Centralized Metadata (e.g., Lustre)

Metadata = Information about files (like filename, size, location).


Here, one special server handles all metadata.

How Lustre works:

1. Client asks metadata server: "Where is my file?"


2. Metadata server replies.
3. Client reads/writes directly to the storage servers (called OSS — Object Storage Servers).

✅ Good for large files and high speed operations.


❗Problem: If metadata server is overloaded → slower performance.
(b) Distributed Metadata (e.g., GlusterFS)

Metadata is spread across all servers.

No single point of failure.


Each server manages part of the metadata.

How GlusterFS works:

Many bricks (pieces of storage) are combined.


Translators (small programs) help manage the files.
Can replicate files automatically for safety.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 29/141
✅ Good for fault tolerance and high availability.
2️⃣ Block Level Virtualization (Working with BLOCKS)
What is Block Level Virtualization?
Applications don't see files, but blocks (raw storage units).
It combines multiple disks into one logical disk.

You get one big block storage, even though it is built from many physical disks.

There are three types:

(a) Host-Based Virtualization (using LVM)

LVM (Logical Volume Manager) runs on the server.


It combines multiple physical disks into logical volumes.
Applications store files as if using one big disk.

✅ Server controls and manages storage.


(b) Storage-Device Level Virtualization (using RAID)

RAID arrays inside storage devices handle virtualization.


Controllers combine multiple disks into LUNs (Logical Units).

✅ Fast, host-agnostic, managed inside storage box itself.


(c) Network-Based Virtualization (SAN)

Happens inside the network connecting servers and storage.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 30/141
Common in SANs (Storage Area Networks).

Two types:

1. Switch-based — done inside intelligent switches (Example: HP SAN switches).


2. Appliance-based — using special devices (Example: IBM SAN Volume Controller).

In Appliance-based Virtualization:

In-Band Mode:
All data flows through the appliance.

Appliance manages everything (both data and metadata).


Out-of-Band Mode:
Only metadata (info about files) flows through the appliance.
Data goes directly between servers and storage.

🎯 Summary for Newbies:


Concept In Simple Words

Storage Virtualization Hides real storage devices and shows one simple storage space
File Virtualization Works with files — sharing, moving, replicating

Block Virtualization Works with raw storage blocks

Host-based Storage managed by the server

Storage-device level Managed inside the storage box (RAID)


Network-based Managed inside the network (SAN)

Lustre Distributed file system with one metadata server

GlusterFS Distributed file system with no single metadata server

In-band appliance Appliance handles all data and commands

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 31/141
Concept In Simple Words
Out-of-band appliance Appliance handles only commands (metadata)

Would you also like me to draw a simple diagram for you showing how Lustre and GlusterFS architecture looks like? 📈
🎨✨
(Visuals can make it even easier!)
Should I?

CLOUD COMPUTING
Object Storage
• Object storage is prominently used approach while building cloud storage systems
• Object storage is different from block or file storage as it allows a user to store data in the
form of objects (essentially files in a logical view) by virtualizing the physical
implementation in a flat namespace eliminating name collisions using REST HTTP APIs.
Block Storage
Object Storage
Objects contain additional descriptive properties which can be used for better indexing or management.
CLOUD COMPUTING
Object Storage
Example of the Object
CLOUD COMPUTING
Object Storage
• Object storage :
• Data manipulated using GET, PUT, DELETE, UPDATE
• Use REpresentational State Transfer (REST) APIs
• A distillation of the way the Web already works
• Resources are identified by uniform resource identifiers (URIs)
• Resources are manipulated through their representations
• Messages are self-descriptive and stateless (XML)
• Multiple representations are accepted or sent
• Objects contain additional descriptive properties which can be used for better indexing or
management.
• Object storage also allows the addressing and identification of individual objects by more than just

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 32/141
file name and file path.
CLOUD COMPUTING
Object Storage
• Object storage is built using scale-out distributed systems where each node, most often, actually runs on a local
filesystem.
• Object storage does not need specialized & expensive hardware and the architecture allows for the use of commodity
hardware
• The most critical tasks of an object storage system are :
• Data placement
• Automating management tasks, including durability and availability
• Typically, a user sends their HTTP GET, PUT, POST, HEAD, or DELETE request to any one node out from a set of nodes,
and the request is translated to physical nodes by the object storage software.
• The object storage software also takes care of the durability model by doing any one of the following:
• creating multiple copies of the object and chunking it
• creating erasure codes (Erasure coding (EC) is a method of data protection in which data is broken into fragments,
expanded and encoded with redundant data pieces and stored across a set of different locations or storage media
or a combination of these.
• Supports activities towards Management, such as periodic health checks, self-healing, and data migration.
• Management is also made easy by using a single flat namespace, which means that a storage administrator can
manage the entire cluster as a single entity.
Object Storage
CLOUD COMPUTING
: Illustration – Amazon Simple Storage Service (S3)
Amazon Simple Storage Service (S3)
• Amazon S3 is a highly reliable, highly available, scalable and fast storage in the cloud for storing and
retrieving large amounts of data just through simple web services.
• There are three ways of using S3. Most common operations can be performed via the AWS console (GUI
interface to AWS)
• For use of S3 within applications, Amazon provides a REST-ful API with familiar HTTP operations such as
GET, PUT, DELETE, and HEAD.
• There are libraries and SDKs for various languages that abstract these operations
• Since S3 is a storage service, several S3 browsers exist that allow users to explore their S3 account as if it
were a directory (or a folder). There are also file system implementations that let users treat their S3
account as just another directory on their local disk.
Object Storage

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 33/141
CLOUD COMPUTING- Amazon S3 – Organizing Data in S3
Organizing Data In S3: Buckets, Objects and Keys
• Data is stored as objects in S3.
• These objects in S3 are stored in resources called buckets
• S3 objects can be up to 5 Terabytes in size and there are no limits on the number of objects that
can be stored.
• Objects in S3 are replicated across multiple geographic locations to make it resilient to several
types of failures.
• Objects are referred to with keys– basically an optional directory path name followed by the
name of the object.
• If object versioning is enabled, recovery from inadvertent deletions and modifications is
possible.
Object Storage
CLOUD COMPUTING
Object Storage - Amazon Simple Storage Service (S3)
Organizing Data In S3: Buckets, Objects and Keys (Cont.)
• Buckets provide a way to keep related objects in one place and separate them from
others. There can be up to 100 buckets per account and an unlimited number of objects
in a bucket.
• Each object has a key, which can be used as the path to the resource in an HTTP URL.
• Example: if the bucket is named johndoe and the key to an object is resume.doc, then
its HTTP URL is http://s3.amazonaws.com/
johndoe
/
resume.doc
http://johndoe.s3.amazonaws.com/resume.doc
or alternatively,
• By convention, slash-separated keys are used to establish a directory-like naming
scheme for convenient browsing in S3
Object Storage
CLOUD COMPUTING
Object Storage - Amazon Simple Storage Service (S3)
Security
• Users can ensure the security of their S3 data by two methods
1. Access control to objects: Users can set permissions that allow others to access their

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 34/141
objects. This is accomplished via the AWS Management Console.
2. Audit logs: S3 allows users to turn on logging for a bucket, in which case it stores
complete access logs for the bucket in a different bucket. This allows users to see
which AWS account accessed the objects, the time of access, the IP address from
which the accesses took place and the operations that were performed. Logging can
be enabled from the AWS Management Console
Object Storage
CLOUD COMPUTING
Object Storage - Amazon Simple Storage Service (S3)
Data Protection
1. Replication:
• By default, S3 replicates data across multiple storage devices, and is designed to survive two
replica failures.
• It is also possible to request Reduced Redundancy Storage(RRS) for noncritical data. RRS
data is replicated twice, and is designed to survive one replica failure.
• S3 provides strong consistency if used in that mode and guarantees consistency among the
replicas.
2. Regions:
• For performance, legal, Availability and other reasons, it may be desirable to have S3 data
running in specific geographic locations.
• This can be accomplished at the bucket level by selecting the region that the bucket is
stored in during its creation.
Object Storage
CLOUD COMPUTING
Object Storage - Amazon Simple Storage Service (S3)
Data Protection
3. Versioning:
• If versioning is enabled on a bucket, then S3 automatically stores the full history of all
objects in the bucket from that time onwards.
• The object can be restored to a prior version and even deletes can be undone. This
guarantees that data is never inadvertently lost
Large Objects and Multi-part Uploads
• S3 provides APIs that allow the developer to write a program that splits a large object into
several parts and uploads each part independently.
• These uploads can be parallelized for greater speed to maximize the network utilization. If a

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 35/141
part fails to upload, only that part needs to be re-tried
CLOUD COMPUTING
Open Stack Swift – Another Illustration of Object Storage
CLOUD COMPUTING
Swift
• Swift Partitions breaks the storage available into locations where data will be located including account
databases, container databases or objects. It’s the core of the replication system and distributed across
all disks.
Its like a moving bin moving in a warehouse. These are not linux partitions. Adding a node leads to
reassigning of partitions
• Account is an user in the storage
system- unlike volumes, swift creates
accounts which enables multiple
users and applications to access the
storage system at the same time
• Container: Containers are where
Accounts create and store data.
Containers are name spaces used to group objects (conceptually like directories)
• Object : Actual data is stored on the disk. This could be photos etc.
• Ringmaps the partition space to physical
locations on disk. Its like an encyclopedia,
instead of letters swift used hash for searching
CLOUD COMPUTING
NoSQL Database - DynamoDB
• DynamoDB is a NoSQL fully managed cloud-based document and a Key Value database available through
Amazon Web Services (AWS)
• All data in DynamoDB is stored in tables that you have to create and define in advance, though tables
have some flexible elements and can be modified later
• DynamoDB requires users to define only some aspects of tables, most importantly the structure of keys
and local secondary indexes, while retaining a schema less flavor.
• DynamoDB enables users to query data based on secondary indexes too rather than solely on the basis of
a primary key
• DynamoDB supports only item-level consistency, which is analogous to row-level consistency in RDBMSs
• If consistency across items is a necessity, DynamoDB is not the right choice
• DynamoDB has no concept of joins between tables. Table is the highest level at which data can be

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 36/141
grouped and manipulated, and any join-style capabilities that you need will have to be implemented on
the application side
CLOUD COMPUTING
DynamoDB
• In DynamoDB, tables, items, and attributes are the core components
• A table is a collection of items and each item is a collection of attributes
• DynamoDBuses primary keys to uniquely identify each item in a table and
secondary indexes to provide more querying flexibility
• DynamoDBsupports two different kinds of primary keys:
• Partition key – A simple primary key, composed of one attribute known as the
partition key. DynamoDB uses the partition key's value as input to an internal
hash function. The output from the hash function determines the partition
(physical storage internal to DynamoDB) in which the item will be stored.
• Partition key and sort key – Referred to as a composite primary key, this type of
key is composed of two attributes. The first attribute is the partition key and the
second attribute is the sort key. All items with the same partition key value are
stored together in sorted order by sort key value.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/dynamodb-dg.pdf#HowItWorks
CLOUD COMPUTING
DynamoDB
Secondary Indexes
• Users can create one or more secondary indexes on a table.
• A secondary index lets users query the data in the table using an alternate key, in
addition to queries against the primary key.
Partitions and Data Distribution
• DynamoDB stores data in partitions.
• A partition is an allocation of storage for a table, backed by solid state drives (SSDs) and
automatically replicated across multiple Availability Zones within an AWS Region.
• Partition management is handled entirely by DynamoDB
CLOUD COMPUTING
DynamoDB Architectur

Explain the above for newbie in depth with above word and letters and sentences

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 37/141
Of course! I'll explain everything you posted, in depth, slowly, clearly, and without changing your words too much, so that a newbie can understand.
I'll keep the same flow and same structure but explain it like I'm teaching it to someone brand new.
Here’s the explanation:

☁️ CLOUD COMPUTING
📦 Object Storage
Object storage is a popular way to store data when building cloud storage systems.
It is different from traditional storage like block storage or file storage.
In object storage, data is stored as objects — think of an object like a file.
The way these objects are stored hides (or virtualizes) the real (physical) details.

Object storage uses a flat namespace. This means there are no folders or trees like in your computer — everything is on the same level.
Name collisions (two files having the same name) are avoided because of how it's built.
Object storage is usually accessed using REST HTTP APIs — meaning you can send messages over the internet to store, get, delete, or update objects.

📦 Block Storage vs Object Storage


Block Storage is like what you have on your computer's hard drive.
Object Storage stores not only the data but also extra information (metadata) about the object, making it easier to search, manage, and organize.

📦 More on Object Storage


In Object Storage, you work with data using actions like:
GET (read it),
PUT (save it),

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 38/141
DELETE (remove it),
UPDATE (modify it).
It uses REST APIs, which is a standard way of talking to web services.
The design follows the same ideas that the web uses:
Every object/resource has a URI (Uniform Resource Identifier) — like a web address.
Messages are self-descriptive (they explain themselves) and stateless (the server doesn't remember your previous messages).

Messages often use XML or similar formats.


Multiple versions (formats) of the same object can be accepted or sent.
Object storage lets you find and access data using more than just its name and path — you can use the extra descriptive data too!

📦 How Object Storage Works Internally


Object storage is built using distributed systems.

It uses many normal computers (called nodes) instead of special expensive servers.
The important jobs of an object storage system are:
Placing data smartly across the system,
Managing data automatically (making sure it's safe and available),
Health checks and self-healing if something goes wrong.
When a user sends a GET, PUT, POST, HEAD, or DELETE command, the request can go to any node, and the software figures out where the actual data
is.
To protect your data, it might:
Make multiple copies (replicas),
Use erasure coding (breaking it into pieces with extra info for rebuilding if part of it is lost),

Or both.
Flat namespace makes it easy for administrators to manage the whole storage system as if it's one big unit.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 39/141
☁️ Amazon Simple Storage Service (S3)
📂 What is Amazon S3?
Amazon S3 is a cloud storage service.
It is:

Highly reliable,
Always available,
Scalable (can grow easily),
Fast.
You can use S3 in three ways:
1. AWS Console — a graphical interface (point and click).
2. REST API — using HTTP commands like GET, PUT, DELETE, HEAD.

3. SDKs and Libraries — for different programming languages.


You can also use S3 browsers that make S3 look like a normal folder on your computer.

📂 How Data is Organized in S3


Data in S3 is stored as objects.
These objects live inside buckets (think of buckets as folders).

A single object can be up to 5 Terabytes in size.


No limit on the number of objects you can store.
Objects are copied across multiple geographical locations for protection.
Each object is referred to by a key — like a path or name.
Example:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 40/141
Bucket: `johndoe`
Key: `resume.doc`
URL:
`http://s3.amazonaws.com/johndoe/resume.doc`
or
`http://johndoe.s3.amazonaws.com/resume.doc`
You can make the key look like folders using slashes (`/`).

📂 Security in S3
Two main ways to protect your data:

1. Access Control:
You can control who can see or change your objects.
2. Audit Logs:

You can turn on logging to track who accessed your data, when, from where, and what they did.

📂 Data Protection in S3
1. Replication:
S3 copies your data multiple times across different devices.
If you choose Reduced Redundancy Storage (RRS), it keeps only two copies (good for non-important data).

2. Regions:
You can pick a geographical region where your data lives (e.g., USA, Europe) when you create a bucket.
3. Versioning:
If you enable versioning, S3 saves every version of an object.
You can recover old versions or undo deletes.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 41/141
📂 Handling Large Files
S3 allows multipart uploads:

Break a large file into many smaller parts.


Upload parts in parallel to make it faster.
If one part fails, you can retry uploading just that part.

☁️ OpenStack Swift (Another Example of Object Storage)


📦 What is Swift?
Swift is another object storage system, part of OpenStack.
Swift uses Partitions to manage storage:
A partition is not a Linux partition.
It’s a way of dividing storage locations internally.
When new nodes are added, partitions are rearranged.
Swift Concepts:

Account: Like a user account.


Container: Like a folder — a namespace for grouping objects.
Object: The actual data (photos, documents, etc.).
Swift uses a Ring:
The Ring maps where data lives by using hashes.
Imagine an encyclopedia, but instead of "A-Z", Swift uses hashes to find data.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 42/141
☁️ NoSQL Database - DynamoDB
📚 What is DynamoDB?
DynamoDB is a fully managed NoSQL database by Amazon AWS.
It stores data in tables.
Tables have a flexible schema:
You define keys and indexes, but not every detail.
You can query DynamoDB using secondary indexes, not just the primary key.
DynamoDB supports item-level consistency:
Like row-level consistency in traditional databases.

DynamoDB does not support joins:


If you want to join data from multiple tables, you must do it yourself in your code.

📚 Core Components of DynamoDB


Table: Collection of items.
Item: Single row.

Attribute: Single field (column).


Primary Keys:
Partition Key:
One attribute only.
Determines where the item is stored (by hashing the value).
Partition Key + Sort Key:
Two attributes.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 43/141
Items with the same partition key are stored together and sorted by the sort key.

📚 Secondary Indexes in DynamoDB


You can create secondary indexes to:
Query using different keys, not just the primary key.

📚 Partitions and Data Distribution


DynamoDB divides storage into partitions.
Each partition is:
Stored on SSD (solid-state drives),
Replicated across multiple Availability Zones for safety.
DynamoDB handles partition management automatically.

Would you like me to also summarize all this into a simple mind map or bullet points to make it even easier to review quickly? 🌟
(Just say the word!)🚀

Partitioning
• Cloud Applications typically scale using elastic computing resources for addressing variations in the workload.
This could lead to a bottleneck in the backend in-spite of scaling the hosts, if they continue to use the same data
store.
• Partitioning is a way of intentionally breaking this large Volume of data into smaller partitions of data, where
each of these partition can be placed on a single node or distributed across different nodes in a cluster, such
that query or IO operations can be done across these multiple nodes supporting the performance and
throughput expectations of applications.
• Partitions are defined in such a way that each piece of data (if its say a DB, then each record, row, or document)
belongs to exactly one partition. There could be many operations which will touch partitions at the same time
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 44/141
• Each node can independently execute the queries or do IO and operate on these single (own) partition enabling
scaling of throughput with the addition of more nodes.
Eg. Partitioning a large piece of data to be written to disk into multiple partitions and distributing these
partitions to multiple disks will lead to better total IO performance.
CLOUD COMPUTING
Partitioning
• Thus Large complex queries or large IO workload can potentially be parallelized across many nodes
• Partitioning is usually combined with replication so that copies of each partition are stored on multiple nodes.
This means that, even though each record belongs to exactly one partition, it may still be stored on several
different nodes for fault tolerance. (We will discuss about replication in subsequent sessions)
• A node may store more than one partition
CLOUD COMPUTING
Goal of Partitioning and some terminologies
• Goal of partitioning is to spread the data and the query load evenly across nodes.
• If some partitions have more data than others, we call it skewed
• The presence of skew makes partitioning much less effective. In an extreme case, all the load could end up
on one partition.
• A partition with disproportionately high load is called a hot spot
CLOUD COMPUTING
Different Approaches of Partitioning
There are 4 different approaches of partitioning
1. Vertical Partitioning
▪ In this approach, data is partitioned vertically.
▪ Its also called column partitioning where set of columns are stored on one data store and other to a
different data store (could be on a different node) and data is distributed accordingly.
▪ In this approach no two critical columns are stored together which improve the performance
Relational DB in a Block/File Storage
2. Workload Driven Partitioning
Object Storage
▪ Data access patterns generated from the application is analysed and partitions are formed according to
that. This improves the scalability of transactions in terms of throughput and response time
CLOUD COMPUTING
Different Approaches of Partitioning
3. Partitioning by Random Assignment
▪ Random assignment of of records to nodes would be the simplest approach for avoiding hot spots.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 45/141
▪ This would distribute the data quite evenly across the nodes,
▪ The disadvantage of this would be, when trying to read a particular item, you have no way of knowing
which node it is on, so you have to query all nodes in parallel.
4. Horizontal Partitioning
▪ This is a static approach of horizontally partitioning data to store it on different nodes
▪ Once its partitioned, this would not change.
▪ There are different techniques which are used for horizontal partitioning like
1. Partitioning by Key Range
2. Partitioning using the Schema
3. Partitioning using Graph Partitioning
4. Partitioning using Hashing
CLOUD COMPUTING
4.1 Partitioning by Key -Range
• Range partitioning, is partitioning the cloud data by using range of keys by assigning a continuous
range of keys to each partition (from some minimum to some maximum).
• The values of the keys should be adjacent but not overlapped
• Range of keys is decided on the basis of some conditions or operators.
• If we know the boundaries between the ranges, we can easily determine which partition contains a given key
• If we also know which partition is assigned to which node, then we can make the request directly to the appropriate
node
• The ranges of keys are not necessarily evenly spaced, because data may not be evenly distributed Within each
partition, we can keep keys in sorted order.
• Example:
• Consider an application that stores data from a network of sensors, where the key is the timestamp of the
measurement (year-month-day-hour-minute-second).
• Range scans are very useful in the case where all the readings for a particular month need to be fetched.
• The disadvantage of key range partitioning is that certain access patterns can lead to hot spots
• If the key is a timestamp, all writes end up going to the same partition (the one for today), so that partition can be
overloaded with writes while others are idle
• To avoid this problem, we can use some other attribute other than the timestamp as the first element of the key
CLOUD COMPUTING
4.2-4.3 : Schema Based and Graph Partitioning :
4.2 Schema Based Partitioning
• Schema Partitioning is basically designed to minimize the distributed transactions.
• In this database schema is partitioned in such a way that related rows are kept in the same partition instead

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 46/141
of separating them in different partitions.
4.3 Graph Partitioning
• Graph partitioning is a workload- based static partition in which partitions are made by analysing the pattern
of data access.
• Once partitioning is done, the workload is not observed for changes and there is no-repartitioning
• We could have a workload based dynamic partitioning strategy too .. Which we will discuss as part of
Partition rebalancing
CLOUD COMPUTING
4.4 Partitioning by Hash of Key
• Hash partitioning maps data to partitions based on a hashing algorithm that is applied to the partitioning key
identified.
• The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size
and thus avoids the risk of skew and hot spots
• Hash partitioning is also an easy-to-use alternative to range-partitioning, especially when the data to be
partitioned is not historical or has no obvious partitioning key
• Using a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of
keys), and every key whose hash falls within a partition’s range will be stored in that partition.
https://notes.shichao.io/dda/ch6/#:~:text=Document%2Dpartitioned%20indexes%20(local%20indexes,scatter%2Fgather%20across%20all%20partitions.
CLOUD COMPUTING
Different Approaches of Partitioning : Partitioning by Hash of Key (Cont.)
• By using the hash of the key for partitioning we lose the ability to do efficient range queries
• Some approaches to circumvent this would be to concatenated index (or a composite key) approach which
can enable a one-to-many relationships.
Example: On a social media site, one user may post many updates. If the primary key for updates is chosen to
be (user_id, update_timestamp), then you can efficiently retrieve all updates made by a particular user
within some time interval, sorted by timestamp.
• Different users may be stored on different partitions, but within each user, the updates are stored ordered
by timestamp on a single partition.
Example :
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (1, 100, 1000);
GO
CREATE PARTITION SCHEME myRangePS1
AS PARTITION myRangePF1
TO (test1fg, test2fg, test3fg, test4fg);

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 47/141
GO
CREATE TABLE PartitionTable (col1 int, col2 char(10))
ON myRangePS1 (col1);
GO
• Partition test1fg will contain tuples with col1 values <= 1
• Partition test2fg will contain tuples with col1 values > 1 and <= 100
• Partition test3fg will contain tuples with col1 values > 100 and <= 1000
• Partition test4fg will contain tuples with col1 values > 1000
CLOUD COMPUTING
4.4 Distributed Hashing
• The hash table discussed earlier may need to be split into several parts and stored in different servers for
working around the memory limitations of a single system to create and keep these large hash tables
• In these environments the data partitions and their keys (the hash table) are distributed among many
servers (thus the name of distributed hashing)
• The mechanism for distribution of the keys onto different servers could be a simple hash modulo of the
number of servers in the environment. Detailing, for the partition key under consideration, compute the hash
and then by using modulo of the number of servers, determine which server the key and the partition will
need to be stored in or read from.
• These setups also typically consist of a pool of caching servers that host many key/value pairs and are used to
provide fast access to data. If the query for a key for an partition information is not available on the cache,
then the data is retrieved from the distributed hashing table as a cache miss.
So if a client needs to retrieve the partition with key say “Name”, then it looks it up in the cache and gets
location of the partition. If the cache entry for “Name” does not exist, then it does a hash and the mod of the
number of servers, for getting the information on which server the partition exists and then retrieves the
partition information, and also populates it back into the cache for future.
CLOUD COMPUTING
4.4 Distributed Hashing – Rehashing Problem
• This Distributed hashing scheme is simple, intuitive, and works fine .. but may have issues in say the following
two scenarios
• What if we add a few more servers into the pool for scaling
• What if one of the servers fail
• Keys need to be redistributed to account for the missing server or to account for new servers in the pool
• This is true for any distribution scheme, but the problem with our simple modulo distribution is that when the
number of servers changes, most hashes modulo N will change, so most keys will need to be moved to a
different server. So, even if a single server is removed or added, all keys will likely need to be rehashed into a

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 48/141
different server. Eg. Lets say Server 2 (of 0,1 and 2) failed. The new keys are below. Note that all key locations
changed, not only the ones from server C representing 2 in the HASH mod 3.
Typical use case discussed earlier with caching, this would mean that,
all of a sudden, the keys won’t be found because they won’t yet be
present at their new location.
So, most queries will result in misses, and the original data will likely
need retrieving again from the source to be rehashed, thus placing a
heavy load on the origin
CLOUD COMPUTING
Consistent Hashing
• Addressing this would need a distribution scheme that does not depend directly on
the number of servers, so that, when adding or removing servers, the number of
keys that need to be relocated is minimized.
• Consistent Hashing is a distributed hashing scheme that operates independently of
the number of servers or objects in a distributed hash table by assigning them a
position on an abstract circle, or hash ring. This allows servers and objects to scale
without affecting the overall system.
Consider the as hash outputs of partitions
All hash values linearly fit on the circle with an angle
Lets consider the nodes/servers using say their
names are also hashed with their names and
placed on the same circle as shown
• Imagine we mapped the hash output range onto the edge of a circle. That means that the minimum possible hash value,
zero, would correspond to an angle of zero, the maximum possible value would correspond to an angle of 360 degrees, and
all other hash values would linearly fit somewhere in between.
• So, we could take the keys, compute their hash, and place it on the circle’s edge. If we now consider the servers too and
using their names, compute a hash and place them also on the edge of the circle.
• Since we have the keys for both the objects and the servers on the same circle, we can define a simple rule to associate the
former with the latter:
• Each object key will belong in the server whose key is closest, in a counterclockwise direction (or clockwise, depending on
the conventions used). Thus to find out which server to ask for a given key, we need to locate the key on the circle and move
in the ascending angle direction until we find a server.
• Now if a server is added and placed on the circle, then only those which are closest to it in clockwise direction will need to
be moved and the rest of them will not need changes. If a server fails its similar only those behind will need to change.
CLOUD COMPUTING

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 49/141
Partitioning and Secondary Indexes
• We have seen till now that If records are only accessed via their primary key, we can
determine the partition from that key, and use it to route read and write requests to the
partition responsible for that key.
• A secondary index usually doesn’t identify a record uniquely but rather is a way of
searching for occurrences of a particular value
E.g. find all actions by user 123, find all articles containing the word hogwash,
find all cars whose color is red, and so on.
• Secondary indexes don’t map neatly to partitions.
• Two main approaches to partitioning a database with secondary indexes:
1. Document-based partitioning
2. Term-based partitioning
CLOUD COMPUTING
Partitioning with secondary Indexes : Document based Partitioning
• In this indexing approach, each partition is completely separate
• Each partition maintains its own secondary indexes, covering only the documents in that
partition. It doesn’t care what data is stored in other partitions.
• Whenever you need to write to the database, you only need to deal with the partition that
contains the document ID that you are writing. For that reason, a document-partitioned
index is also known as a local index
• Reading from a document-partitioned index requires all partitions to be queried and
combining all the results.
• This approach to querying a partitioned database is sometimes known as scatter/ gather
and it can make read queries on secondary indexes quite expensive.
CLOUD COMPUTING
Partitioning with secondary Indexes : Document based Partitioning
• Even if the partitions are queried in parallel, scatter/gather is prone to tail latency amplification
• Most database vendors recommend that you structure your partitioning scheme so that secondary
index queries can be served from a single partition, but that is not always possible, especially when
we’re using multiple secondary indexes in a single query
CLOUD COMPUTING
Partitioning with secondary Indexes : Term based Partitioning
• Construct a global index that covers data in all partitions
• A global index must also be partitioned, but it can be partitioned differently from the primary key
index.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 50/141
• This kind of index is called term-partitioned, because the term we’re looking for determines the
partition of the index.
CLOUD COMPUTING
Partitioning with secondary Indexes : Term based Partitioning
• We can partition the index by the term itself or using a hash of the term.
• Partitioning by the term itself can be useful for range scans, whereas partitioning on a hash of the
term gives a more even distribution of load.
• The advantage of a global (term-partitioned) index over a document-partitioned index is that it can
make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to
make a request to the partition containing the term that it wants.
• The downside of a global index is that writes are slower and more complicated, because a write to a
single document may now affect multiple partitions of the index (every term in the document might
be on a different partition on a different node).
• The global index needs to be up to date so that every document written to the database would
immediately be reflected in the index. That would require a distributed transaction across all
partitions affected by a write, which is not supported in all databases. Typically these updates are
often asynchronous
• Therefore, if you read the index shortly after a write, the change you just made may not yet be
reflected in the index

Explain the above for newbie in depth with above word and letters and sentences

Of course! I’ll explain all of it carefully, using your words and ideas, but breaking it down even more clearly for someone new to Cloud Computing and
Partitioning.
I’ll keep the meaning same as what you gave, just explain it in very easy language, step-by-step.

Partitioning in Cloud Computing (Simplified for Beginners)

Why Partitioning is Needed


Cloud Applications use elastic computing, meaning they can add or remove servers based on how much work there is.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 51/141
But even if you add many servers, the backend (like the database) can still become a bottleneck (slow point) if all servers keep accessing the same data
storage.
Partitioning helps by breaking big data into smaller pieces (called partitions).

These smaller partitions can be spread across different servers so that multiple servers can work in parallel to handle queries and storage.

🔵 Example:
Instead of writing all data into one disk, you divide the data into parts and write each part to a different disk → faster overall writing and reading.

How Partitioning Helps


Each server handles its own partition independently.
More servers = More partitions = Higher performance.
Large queries or heavy IO (input/output) operations can be parallelized across many servers.
Replication (copying partitions to multiple servers) helps for fault tolerance (we’ll discuss this later).

A server can hold one or more partitions.

Key Goals and Terminology


Goal: Spread data and query load evenly across servers.
Skewed Partition: Some partitions have more data than others → bad for performance.
Hotspot: A partition that gets too much load compared to others.

Different Approaches of Partitioning


There are four main ways to do partitioning:

1. Vertical Partitioning

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 52/141
Break columns into different partitions.
Example: In a table, keep important columns (like name, age) separately from other columns (like address, hobbies).
It’s useful in Relational Databases.

2. Workload-Driven Partitioning

Analyze how the application accesses data.

Create partitions based on usage patterns to make reads and writes faster.

3. Partitioning by Random Assignment

Randomly assign records to servers to avoid hotspots.


Problem: You don’t know which server has the data you want → you may have to ask all servers.

4. Horizontal Partitioning

Divide rows across different servers.


It’s a static method (doesn’t change over time).

4 ways to do horizontal partitioning:

4.1 Partitioning by Key Range

Use key ranges (like values between 1-100, 101-200, etc.) to partition data.
Example: Sensor readings by timestamp (month-year-day).

Problem: If new data always goes to the same range, it can cause hotspots.
Solution: Use another attribute besides timestamp.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 53/141
4.2 Schema-Based Partitioning

Keep related rows together.


Reduces distributed transactions (where you have to fetch from multiple partitions).

4.3 Graph Partitioning

Look at the data access patterns (like a graph of who talks to who) and partition accordingly.
It’s static — once partitioned, it doesn't change unless you manually rebalance later.

4.4 Partitioning by Hash of Key

Apply a hash function to the key to decide which partition it goes into.
Hashing spreads data evenly and avoids hotspots.
Problem: Hashing makes range queries (getting a set of records between two values) difficult.
Solution: Use a composite key (like combining user ID and timestamp).

🔵 Example SQL:
sql

`CREATE PARTITION FUNCTION myRangePF1 (int)


AS RANGE LEFT FOR VALUES (1, 100, 1000);

CREATE PARTITION SCHEME myRangePS1


AS PARTITION myRangePF1
TO (test1fg, test2fg, test3fg, test4fg);

CREATE TABLE PartitionTable (col1 int, col2 char(10))


ON myRangePS1 (col1);
`

Meaning:

Values <= 1 → go to partition test1fg


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 54/141
1 < value <= 100 → test2fg, etc.

4.4 Distributed Hashing

Sometimes, hash tables are too big for one server → so split across many servers.
Simple method: Take the hash of a key and mod it by the number of servers to decide where the key goes.
Problem:
Adding/removing servers means all keys change.

It causes cache misses and high load because data needs to be re-fetched.

Consistent Hashing
A better way to handle server changes.
Map keys and servers onto a circle (hash ring).
Each key goes to the next server in clockwise direction.

Adding/removing a server only affects nearby keys — most keys stay on the same server.

Partitioning and Secondary Indexes


So far, we assumed queries are made using the primary key.
What if we want to search by other attributes? (e.g., find all users with color "red").

Two methods:

Document-Based Partitioning

Each partition has its own local index.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 55/141
Good for writes (only the partition containing the record is updated).

Bad for reads (you must ask all partitions to find the data → scatter/gather → slow).

Term-Based Partitioning (Global Index)

Create a global index that covers all partitions.


Partition the index itself by the term or hash of term.
Good for reads (only check one partition).
Bad for writes (writing a new document might update multiple partitions asynchronously).

Summary
Partitioning divides big data into smaller pieces for scalability and performance.
Several strategies exist: vertical, workload-based, random, and horizontal.
Horizontal partitioning has many subtypes (key range, hash, schema, graph).
Distributed Hashing is powerful but needs consistent hashing to handle server changes smoothly.
Secondary indexes need special care to partition properly for fast queries.

Would you also like me to draw a simple diagram (like a circle for consistent hashing, and how partitioning looks)? 🎨
It will make it even easier to understand!🚀

Skewed Workloads and Relieving Hot Spots


• We saw that Hashing a key to determine its partition can help reduce hot spots. However, it can’t
avoid them entirely
• In the extreme case where all reads and writes are for the same key, all requests will be routed to the
same partition.
• Most data systems may not be able to automatically compensate for such a highly skewed workload,
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 56/141
and this may be handled by simple techniques like adding a random number to the beginning or end
of the key in case of one of the key is known to be very hot or the applications may need to support
this. This can address it in a limited fashion and can cause lower performances as for reads data will
need to be read from all the keys and combined.
• Distributed hashing discussed earlier, will help in distributing and supporting the throughput, but
there will always be scenarios where a workload may need to be moved across nodes.
CLOUD COMPUTING
Rebalancing Partitions
• Since over time, things change like
• The query throughput in a Database increases, so you want to add more CPUs to handle the load.
• The dataset size increases, so you want to add a new partition with appropriate capacity
• A machine fails, and other machines need to take over the failed machine's responsibilities.
• A Machine joins in .. A new consumer or an recovered machine
• All of these changes call for data and requests to be moved from one node to another. The process of moving
load from one node in the cluster to another is called rebalancing.
• Rebalancing is expected to meet some minimum requirements regardless of the partitioning scheme:
• After rebalancing, the load (data storage, read and write requests) should be shared fairly between the
nodes in the cluster.
• While rebalancing is in progress, the database should continue to accept reads and writes.
• No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize
the network and disk I/O load.
https://notes.shichao.io/dda/ch6/#:~:text=Document%2Dpartitioned%20indexes%20(local%20indexes,scatter%2Fgather%20across%20all%20partitions
CLOUD COMPUTING
Strategies for Rebalancing Partitions: Hash based rebalancing : hash mod N
• We had discussed as part of distributed hashing, an approach of distribution of the keys onto different
servers using a simple hash modulo of the number of servers in the environment
• So, we partition by the hash of a key, and divide the possible hashes into ranges and assign each range to a
partition (e.g., assign key to partition 0 if 0 ≤ hash(key) < b0 , to partition 1 if b0 ≤ hash(key) < b1 etc.) and
just use mod (the % operator in many programming languages) to associate a partition to a node.
E.g. Say if we have 10 nodes (numbered from 0 to 9), hash(key) mod 10 would return a number between 0
and 9 (if we write the hash as a decimal number, the hash mod 10 would be the last digit) which seems like
an easy way of assigning each key to a node.
• We also discussed that in this approach if the number of nodes N changes, most of the keys will need to be
moved from one node to another.
• So this frequent moves make rebalancing using this approach to be excessively expensive.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 57/141
CLOUD COMPUTING
Strategies for Rebalancing Partitions : Fixed number of partitions
• Create many more partitions than there are nodes and assign several partitions to each node
• If a node is added to the cluster, the new node can steal a few partitions from every existing node until
partitions are fairly distributed once again
Eg. If there are 04 nodes and 20 partitions. And we add an additional node.
CLOUD COMPUTING
Strategies for Rebalancing Partitions : Fixed number of partitions (Cont.)
• Only entire partitions are moved between nodes.
• The number of partitions does not change, nor does the assignment of keys to partitions. The only thing that
changes is the assignment of partitions to nodes
• This change of assignment is not immediate. It takes some time to transfer a large amount of data over the
network so the old assignment of partitions is used for any reads and writes that happen while the transfer
is in progress.
• Choosing the right number of partitions is difficult if the total size of the dataset is highly variable
• The best performance is achieved when the size of partitions is “just right,” neither too big nor too small,
which can be hard to achieve if the number of partitions is fixed but the dataset size varies.
• A fixed number of partitions is operationally simpler. So many fixed-partition databases choose not to
implement partition splitting but use fixed number of partitions
CLOUD COMPUTING
Strategies for Rebalancing Partitions : Dynamic Partitioning
• For databases that use key range partitioning , a fixed number of partitions with fixed boundaries would be very
inconvenient if the boundaries are wrong. All of the data could end up in one partition and all of the other
partitions could remain empty
• Alternatively if the partitions can dynamically be created, say when a partition grows to exceed a configured size
, it is split into two partitions so that approximately half of the data ends up on each side of the split. One of its
two halves can be transferred to another node in order to balance the load.
If lots of data is deleted & a partition shrinks below some threshold, it can be merged with an adjacent partition
• Advantage of this dynamic partitioning is that the number of partitions adapts to the total data volume
• Each partition is assigned to one node and each node can handle multiple partitions
• In this approach if we start of with a single partition (say like an empty DB), all writes have to be processed by a
single node while the other nodes sit idle.
• Dynamic partitioning is not only suitable for key range partitioned data, but can equally well be used with hash
partitioned data
CLOUD COMPUTING

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 58/141
Strategies for Rebalancing Partitions : Partitioning proportionally to nodes
• With dynamic partitioning, the number of partitions is proportional to the size of the dataset, since
the splitting and merging processes keep the size of each partition between some fixed minimum
and maximum
• With a fixed number of partitions, the size of each partition is proportional to the size of the dataset.
• In both the above cases, the number of partitions is independent of the number of nodes
• Make the number of partitions proportional to the number of nodes—in other words, to have a fixed
number of partitions per node
• The size of each partition grows proportionally to the dataset size while the number of nodes remains
unchanged, but when you increase the number of nodes, the partitions become smaller again
• Since a larger data volume generally requires a larger number of nodes to store, this approach also
keeps the size of each partition fairly stable.
CLOUD COMPUTING
Strategies for Rebalancing Partitions : Partitioning proportionally to nodes
• When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to
split, and then takes ownership of one half of each of those split partitions while leaving the other
half of each partition in place
• The randomization can produce unfair splits, but when averaged over a larger number of
partitions, the new node ends up taking a fair share of the load from the existing nodes.
• Picking partition boundaries randomly requires that hash-based partitioning is used
CLOUD COMPUTING
Request Routing
• Consider a partitioned data distributed across multiple nodes running on multiple machines
• If a client wants to make a request, how does it know which node to connect to?
• As partitions are rebalanced, the assignment of partitions to nodes change. So there is a need to stay on
top of those changes in order to provide information like which IP address and port
number to connect to.
• This is typically called the service discovery problem which has been addressed
with approaches as below.
Approach 1
Allow clients to contact any node (e.g., via a round-robin load balancer). If that
node coincidentally owns the partition to which the request applies, it can handle
the request directly; otherwise, it forwards the request to the appropriate node,
receives the reply and passes the reply along to the client.
CLOUD COMPUTING

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 59/141
Request Routing
Approach 2
Send all requests from clients to a routing tier first, which determines
the node that should handle each request and forwards it accordingly.
This routing tier does not itself handle any requests; it only acts as a
partition-aware load balancer.
Approach 3
Require that clients be aware of the partitioning and the assignment of
partitions to nodes. In this case, a client can connect directly to the
appropriate node, without any intermediary
CLOUD COMPUTING
Request Routing
Approach 4 : ZooKeeper
• Many distributed data systems rely
on a separate coordination service
such as ZooKeeper to keep track of
this cluster metadata
• Each node registers itself in ZooKeeper,
which maintains the authoritative
mapping of partitions to nodes.
• Other actors, such as the routing tier
or the partitioning-aware client, can subscribe to this information in ZooKeeper.
• Whenever a partition changes ownership or a node is added or removed, ZooKeeper notifies the
routing tier so that it can keep its routing information up to date.
CLOUD COMPUTING
Request Routing
ZooKeeper- A Distributed Coordination Service for Distributed Applications
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index
• HBase, SolrCloud and Kafka use ZooKeeper to track partition assignment
• (Revised versions of Kafka have discontinued ZooKeeper and this logic is replaced with KRAFT)
• LinkedIn’s Espresso uses Helix for cluster management (which in turn relies on ZooKeeper),
implementing a routing tier
• Cassandra and Riak use a gossip protocol among the nodes to disseminate any changes in cluster
state. Requests can be sent to any node, and that node forwards them to the appropriate node for
the requested partition

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 60/141
Explain the above for newbie in depth with above word and leters and sentences

Of course! Let’s go through everything step-by-step, keeping it simple, clear, and using your own words, structure, and sentences.
I’ll explain like you’re new but keep the important technical terms so you’re ready for real-world cloud computing discussions.

🌟 Skewed Workloads and Hot Spots


We know hashing a key (using a formula to spread data) helps avoid one server being too busy (hot spot).
BUT... it cannot stop hot spots completely.
Example: if everyone reads and writes the same key, only one server (partition) is getting all the work! 😖
Most database systems can’t fix this automatically.
A simple trick:
Add a random number to the key (before or after).

Example: Instead of key "user123", use "user123-1", "user123-2", etc.


Problem: Now the data for that key is scattered across many places.
So, when you read, you must fetch from all copies and merge = slower performance.
Even distributed hashing (good hash algorithms) can't stop all hot spots.

Sometimes, you must move workloads across nodes when things are uneven.

🌟 Rebalancing Partitions
Why do we need rebalancing?

Database usage grows → need more CPU.


Data grows → need new partition.
A machine fails → others must take over.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 61/141
A new machine joins → need to distribute work.

👉 Rebalancing means moving data and requests between nodes so that work is evenly shared.
Good Rebalancing Rules:

After rebalancing, load must be fair across nodes.

While rebalancing, system must stay online (still accept reads/writes).


Move as little data as possible to finish quickly and save network/disk power.

🌟 Rebalancing Strategies
1. Hash-Based Rebalancing: `hash(key) mod N`

Simple way:
Hash the key and then use mod N (N = number of nodes).
Example: 10 nodes → hash(key) % 10 → gets a number 0-9 → picks node.
PROBLEM:
If N changes (node added/removed), almost all keys move to different nodes!
❌ This makes rebalancing very expensive.
2. Fixed Number of Partitions

Make more partitions than nodes.


Example: 20 partitions and 4 nodes → 5 partitions per node.

When adding a node:


Steal a few partitions from existing nodes.
Only whole partitions move, not individual keys.
Good because:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 62/141
Data movement is limited.
Simple to operate.

Problem:

If dataset size changes a lot, the partition size can become bad (too big or small).
Fixed partitions = not flexible.

3. Dynamic Partitioning

If a partition grows too big, split it into two.


If it shrinks too small, merge it with a neighbor.
Partitions are created automatically based on data volume.
Works for both range-based and hash-based partitioning.

Good because:
Adapts to data growth.
Each partition stays at a good size.
Problem at the start:
When DB is empty, only one partition, only one node is busy.

4. Partitioning Proportionally to Nodes

Idea: Number of partitions matches number of nodes.


Each node has a fixed number of partitions.
When adding a new node:
It splits a few existing partitions randomly.
Takes half of each split.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 63/141
Good because:
Partitions stay small and balanced as dataset grows.
Needs hash-based partitioning.

🌟 Request Routing: How Clients Find the Right Node


When data is partitioned across nodes, how does a client know where to send a request?
👉 This is called Service Discovery.

Ways to Route Requests:

Approach 1: Any Node

Client talks to any node (via load balancer).


If the node has the data, great!
If not, it forwards the request to the right node.

Approach 2: Routing Tier

Client sends all requests to a special routing layer.


Routing tier decides which node handles the request and forwards it.
Routing layer only routes, does not process data.

Approach 3: Client is Smart

Client knows the partition map itself.


Connects directly to the right node, no middleman.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 64/141
Approach 4: Using ZooKeeper

ZooKeeper = special tool to track cluster information.


Nodes register in ZooKeeper.
When a partition moves or a node changes, ZooKeeper notifies everyone.
Routing tier or clients stay updated using ZooKeeper.

Examples:

HBase, SolrCloud, Kafka (old version) used ZooKeeper.


Kafka newer versions use KRaft instead.
LinkedIn's Espresso uses Helix (which uses ZooKeeper inside).

Cassandra, Riak use gossip protocols (nodes chat with each other).

🎯 Summary
Topic Key Point

Skewed Workloads Hashing helps but cannot avoid hot spots completely.
Rebalancing Partitions Moving data between nodes to balance load.

Hash-Based Rebalancing Easy but expensive when adding/removing nodes.

Fixed Partitions Move whole partitions; simple but not flexible.


Dynamic Partitioning Split/merge partitions based on size; adapts automatically.

Partitioning Proportional to Nodes Partitions grow/shrink with number of nodes.

Request Routing Find the right node: any node, routing tier, smart client, or using ZooKeeper.

Would you also like me to create a simple diagram for you to make it even easier to visualize? 🎨
It would show partitions, nodes, and routing flows! 🚀
(Just say yes if you want!)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 65/141
Replication
• We looked at partitioning as way of distributed data across different nodes in a cluster, such that
query or IO operations can be done across these multiple nodes supporting the performance and
throughput expectations of applications.
• Replication is a means of keeping a copy of the same data on multiple machines that are connected
via a network.
• The data and the metadata are replicated for reasons like:
1. To keep data geographically close to the users (and thus reduce latency)
2. To allow the system to continue working even if some of its parts have failed (and thus increase
availability)
3. To scale out the number of machines that can serve read queries (and thus increase read
throughput)
• This is very beneficial from a performance perspective for read-only data.
CLOUD COMPUTING
Replication (Contd)
▪ Each node that stores a copy of the dataset is called a replica.
▪ Every write needs to be processed by every replica; otherwise, the nodes will not hold the same data.
▪ The challenge is how to handle data that changes in a replicated system:
• Should there be a leader replica and if yes, how many?
• Should one use a synchronous or asynchronous propagation of the updates among the replicas?
• How to handle a failed replica if it is the follower? What if the leader failed? How does a resurrection work
• Three popular algorithms for replicating changes between nodes:
• Leader based or single-leader based replication
• Synchronous Replication
• Asynchronous Replication
• Multi-leader
• Leaderless replication
CLOUD COMPUTING
Replication : Leader Based Replication
• Leader based replication is also known as Leader-Follower or master-slave replication
• How do we ensure that all the data is consistent across multiple
replicas?

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 66/141
• One of the replicas is designated the Leader.
• When Users/clients want to write data, they must send their
requests to the leader, which first writes the new data to its
local storage.
• The other replicas are known as followers (read replicas)
• Whenever the leader writes new data to its local storage, it also sends
the data change to all of its followers as part of a replication log
• Each follower takes the log from the leader and updates its local copy of the data‐ base accordingly, by
applying all writes in the same order as they were processed on the leader.
• The client can read from anywhere (leader or the followers) but writes are accepted only by the leader
• Leader-based replication is used in some DBs & distributed message brokers like Kafka and RabbitMQ.
CLOUD COMPUTING
Replication : Single Leader Replication (Rephrased)
Leader
▪ Dedicated compute node (usually also a replica) responsible for propagating changes
▪ Also known as master or primary
▪ Accepts read and write queries
▪ Sends changes as replication logs to followers
Follower
▪ General replica
▪ Also known as slave, secondary, or hot standby
▪ Accepts only read queries and responds with data from local storage/copy
▪ Receives changes from leader(s) and updates local copy accordingly:
▪ Apply all writes in the same order as applied on the leader
CLOUD COMPUTING
Replication : Leader Based Replication : Synchronous Versus Asynchronous Replication
• In synchronous replication, the leader waits until followers have confirmed that it received the write
before reporting success to the user and before making the write visible to other clients.
E.g. the replication to follower 1 is synchronous
• In asynchronous replication, the leader sends the message to its follower (s) but doesn’t wait for a
response from the followers before answering success to the User
E.g. The replication to follower 2 is
asynchronous
Leader based replication with one synchronous & one asynchronous follower
CLOUD COMPUTING

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 67/141
Replication : Leader Based Replication : Implementation of Replication Logs
1. Statement based replication - The leader logs every write request that it executes and sends that statement
log to its followers.
2. Write-ahead log (WAL) shipping - The log is an append-only sequence of bytes containing all writes to the
database. The leader writes the log to disk and sends it across the network to its followers. Then the leader
updates the data (say the DB) and the followers processes also processes this log and make the changes to the
database and thus they builds copy of the exact same data structures as found on the leader.
3. Change data capture (CDC) based replication - Logical log replication – Sequence of records that describe the
write to database tables at the granularity of rows. Its based on the identification, capture and delivery of the
changes made. Replicas can run on different versions or storage engines but use different log formats for
different storage engines. Its also easier to parse for external applications. These logs are also called a logical
log. A logical log for a relational database is usually a sequence of records describing writes to database tables
at the granularity of a row.
4. Trigger based replication (application layer) - A trigger lets users register custom application code that is
automatically executed when a data change (write transaction) occurs in a database system. The trigger has
the opportunity to log this change which can be read by an external process. The external process can then
apply any necessary application logic and replicate the data change to another system.
CLOUD COMPUTING
Replication : Leader Based Replication : Potential Issues and their Handling
Follower Failure: Catch-up recovery
• On its local disk, each follower keeps a log of the data changes it has received from the leader.
• If a follower crashes and is restarted, or if the network between the leader and the follower is
temporarily interrupted, the follower can recover quite easily from its log
• Follower knows the last transaction that was processed before the fault occurred. Thus, the follower
can connect to the leader and request all the data changes that occurred during the time when the
follower was disconnected.
• When the follower has applied these changes, it has caught up to the leader and can continue
receiving a stream of data changes as before
CLOUD COMPUTING
Replication : Leader Based Replication : Potential Issues and their Handling
Leader Failure - Failover
• One of the followers needs to be promoted to be the new leader
• Clients need to be reconfigured to send their writes to the new leader and the other
followers need to start consuming data changes from the new leader.
• Failover can happen manually (an administrator is notified that the leader has failed and

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 68/141
takes the necessary steps to make a new leader) or automatically
• Steps followed in an automatic failover process:
1. Determining that the leader has failed.
2. Choosing a new leader
3. Reconfiguring the system to use the new leader
CLOUD COMPUTING
Replication : Leader Based Replication : Potential Issues and their Handling
Replication Lag
• In Leader-based replication all writes go to the leader, but read-only queries can go to any
replica.
• This makes it attractive also for scalability and latency, in addition to fault-tolerance.
• Forread-mainly workloads: have many followers and distribute the reads across those followers.
• Removesload from the leader, allows read requests to be served by nearby replicas.
• But, only realistic for asynchronous replication otherwise the system will not be available
• If an application reads from an asynchronous follower, it may see outdated information if the
follower has fallen behind.
• This leads to apparent inconsistencies in the database: if you run the same query on the leader
and a follower at the same time, you may get different results, because not all writes would have
been reflected in the follower.
CLOUD COMPUTING
Replication : Leader Based Replication : Potential Issues and their Handling
Replication Lag (Continued)
• This inconsistency is just temporary—if there are no writes to the database in a while, the followers
will eventually catch up and become consistent with the leader. This effect is known as eventual
consistency
• In normal operation, the delay between a write happening on the leader and the same being
reflected on a follower is known as the replication lag. This may be only a fraction of a second and
not noticeable in practice
• When the lag is large, the inconsistencies it introduces are not just a theoretical issue but a real
problem for applications.
CLOUD COMPUTING
Replication : Leader Based Replication : : Potential Issues and their Handling
Identification of Inconsistencies due to Replication Lag
Reading Your Own Writes
• Reading your Own writes will help identify that if the user re-reads the data, they will always see any

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 69/141
updates they submitted themselves.
There are different models for Consistency like the one above, Read-after-write consistency, different
eventual consistency models which will bring in consistency.
CLOUD COMPUTING
Replication : Leader Based Replication : Replication Lag – Potential Solutions
Some possible solutions for Replication Lag:
• A simple rule: always read critical data from the leader and rest from a follower (negates the benefit
of read scaling)
• Monitor the replication lag on followers and prevent queries on any follower with significant lag
behind the leader.
• The client can remember the timestamp of its most recent write—then the system can ensure that
the replica serving any reads for that user reflects updates at least until that timestamp
• Monotonic reads - make sure that each user always makes their reads from the same replica
• Consistent prefix reads - if a sequence of writes happen in a certain order, then anyone reading those
writes should see them appear in the same order

explainn for newbie in depth .Explain the above for newbie in depth with above word and letters and sentences

Alright! Let’s take your whole text and explain it slowly, step-by-step — like you're brand new to these ideas, using the same words, sentences, and
structure you shared. I’ll keep it in-depth but still easy to understand. Let’s go through it together:

Replication (Basics First)


Partitioning: We split (divide) the data across different machines to handle more users and more data. It makes the system fast and scalable.
Replication: Now, instead of just splitting, we make copies of the same data and store them on multiple machines.

Why do we replicate?

1. To keep data close to users → reduces waiting time (latency).


2. To keep working even if machines fail → improves availability.
3. To serve more read requests by letting multiple machines help → improves read performance.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 70/141
Important:
When the data never changes (read-only), replication is SUPER useful because we don’t have to worry about keeping copies in sync.

Replication (Contd) - More Details


A replica = a copy of the dataset stored on a machine.
Every write (new or updated data) must go to every replica so that all copies stay the same.
Challenges arise when data changes:
Should we have a leader replica? (One in charge?)
Should replicas get updates immediately (synchronous) or later (asynchronous)?

What if a replica fails? What if the leader fails?


How do we bring back a failed replica?

There are three main algorithms used for replicating:

1. Leader-based (Single Leader) Replication


2. Multi-leader Replication
3. Leaderless Replication

For now, we are focusing on Leader-based Replication.

Replication: Leader-Based Replication


Also called Leader-Follower or Master-Slave Replication.
How do we keep all copies consistent?
One replica is chosen as the Leader.
All write operations must go to the Leader.

Leader writes the change and sends it to Followers (other replicas).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 71/141
Followers update their local copies in the same order as the leader did.

Note:

Clients (apps, users) can read from any replica (leader or follower).
Writes MUST go to the Leader.

Real Examples:

Databases like PostgreSQL, MySQL, and systems like Kafka use this method.

Replication: Single Leader Replication (Simplified)


Leader:

Special node responsible for managing changes.


Handles both reads and writes.
Sends updates to followers.

Follower:

Normal replica.
Only accepts read queries.
Copies and updates its data by following the Leader’s updates.

Synchronous vs Asynchronous Replication


Imagine:

Leader makes a change and sends it to followers.

Two ways this can happen:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 72/141
1. Synchronous Replication:
Leader waits for followers to say “I got it” before it tells the user “Success”.
Slower but safer.
2. Asynchronous Replication:

Leader doesn’t wait for followers.


Sends the write and immediately says “Success” to the user.
Faster but risky (followers might fall behind).

Example:

Leader can replicate to one follower synchronously and to another asynchronously at the same time.

Replication Logs - How the Leader Communicates Changes


There are 4 methods for replication logs:

1. Statement-based replication:
Leader records the exact statement (query) and sends it to followers.
2. WAL (Write-Ahead Log) shipping:
Leader records all changes in a byte sequence (logs first, changes later).
3. Change Data Capture (CDC):
Instead of byte streams, CDC records row-level changes — more detailed and easier for external apps to understand.
4. Trigger-based replication:
Database uses triggers (small pieces of custom code) to track changes and replicate manually.

Replication: Problems and How to Fix Them

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 73/141
Problem 1: Follower Failure (Catch-up Recovery)
Each follower keeps a local log of what it received.
If a follower crashes or gets disconnected:
When it comes back, it knows what was the last transaction it saw.
It asks the Leader, “Hey, send me what I missed.”
After applying missed changes, it catches up.

Problem 2: Leader Failure (Failover)


If the Leader dies, we need a new Leader.
Followers must promote one of them to become the new Leader.
Clients must start sending writes to the new Leader.

Two ways to handle failover:

1. Manual failover: An admin does it manually.


2. Automatic failover:
Steps:
1. Detect that Leader failed.
2. Choose a new Leader.
3. Reconfigure system to point to new Leader.

Replication Lag
Writes happen on the Leader.
Reads can happen anywhere (Leader or Followers).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 74/141
For systems with lots of reading, it’s great to have many Followers and let them handle read queries.
But problem:

In asynchronous replication, followers might be out of date compared to the Leader.


If you read from a follower while it’s lagging behind, you might get old data → inconsistency.

Replication Lag (Continued)


Normally, the lag is very small (fractions of a second).

But if the lag grows big, it becomes a real problem — users might see wrong information.
However, if no more new data is written, followers will eventually catch up (this is called eventual consistency).

How to Handle Replication Lag


Solutions:

Rule: Always read important (critical) data from the Leader.


Monitor lag: Check followers; if a follower lags too much, don't use it for reads.
Client remembers: After a write, client can remember the write’s timestamp and request reads that are at least that fresh.
Monotonic Reads: Make sure the user’s queries are always served by the same follower.
Consistent Prefix Reads: If multiple changes happen in order, anyone reading them should also see them in that order.

Summary
Replication:

Makes copies of data for faster reads and high availability.


Leader-based replication is the most common method.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 75/141
Big challenges are keeping replicas up-to-date and handling failures.
Solutions like synchronous replication, failover, and replication lag handling help keep the system reliable and fast.

Would you also like me to draw a simple diagram to show how Leader, Followers, replication logs, and failover work? 📈 It might make it even easier! 🚀
Would you like me to? 🎨

Multi-Leader Replication
Recap: Replication is a means keeping a copy of the same data on multiple machines that are
connected via a network. We discussed that there were three popular algorithms for replicating
changes between nodes: single-leader, multi-leader, and leaderless replication.
▪ We discussed Leader Based Replication earlier.
▪ Leader-based replication has a single bottleneck in the leader
▪ All writes must go through it. If there is a network interruption between the user and the
leader, then no writes are allowed.
▪ An alternate approach to consider is , what if we have more than one leader through whom
you can do the writes.
So with this natural extension of the leader-based replication model of allowing more than one node to accept writes will
lead to
• Each node that processes a write must forward that data change to all the other nodes
• This approach is also known as master/master or active/active replication
• Each leader simultaneously acts as a follower to the other leaders.
This would have different Use cases like
• Multi-Datacenteroperationcan have a
leader in each datacenter. Each datacenter
has a regular leader–follower replication and
between datacenters, each datacenter’s
leader replicates its changes to the leaders
in other datacenters.
• Clients with offline operation
Every device has a local database that
acts as a leader
• Collaborative editing

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 76/141
Multi-Leader Replication
CLOUD COMPUTING
Multi-Leader Replication Challenges
CLOUD COMPUTING
A problem with multi-leader replication is that it can lead to write conflicts
• Conflict avoidance -ensure that all writes for a particular record
go through the same leader
• Converging toward a consistent state -Give
each write a unique ID (e.g., a timestamp, a long
random number, a UUID, or a hash of the key
and value), pick the write with the highest ID
as the winner and throw away the other writes.
• Custom conflict resolution logic -Write conflict
resolution logic in application code. That code
may be executed on write or on read
CLOUD COMPUTING
Single-Leader vs. Multi-Leader Replication
Performance
• In a single-leader configuration, every write must go over the internet to the datacenter with the
leader. This can add significant latency to writes and might contravene the purpose of having multiple
datacenters in the first place.
• In a multi-leader configuration, every write can be processed in the local datacenter and is replicated
asynchronously to the other datacenters. Thus, the inter-datacenter network delay is hidden from
users, which means the perceived performance may be better.
Tolerance of datacenter outages
• In a single-leader configuration, if the datacenter with the leader fails, failover can promote a
follower in another datacenter to be leader.
• In a multi-leader configuration, each datacenter can continue operating independently of the others,
and replication catches up when the failed datacenter comes back online.
CLOUD COMPUTING
Single-Leader vs. Multi-Leader Replication
Tolerance of network problems
• Traffic between datacenters usually goes over the public internet, which may be less reliable than
the local network within a datacenter.
• A single-leader configuration is very sensitive to problems in this inter-datacenter link, because

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 77/141
writes are made synchronously over this link.
• A multi-leader configuration with asynchronous replication can usually tolerate network problems
better: a temporary network interruption does not prevent writes being processed.
CLOUD COMPUTING
Multi-Leader Replication Topologies
• A replication topology describes the communication paths along which writes are propagated from
one node to another.
• Circular topology- each node receives writes from one node and forwards those writes (plus any
writes of its own) to one other node
• Star topology - one designated root node forwards writes to all of the other nodes.
• All-to-all topology - allows messages to travel along different paths, avoiding a single point of failure

Explain the above for newbie in depth with above word and leters and sentences

Of course! I’ll explain everything you posted very clearly and slowly like you asked, keeping your original words, letters, and sentences as much as
possible. I'll just break it down in depth so a newbie can understand easily.

🌟 Multi-Leader Replication (Explained for Newbie)


Recap: What is Replication?
Replication is a way of keeping a copy of the same data on multiple machines that are connected using a network.
We already discussed that there are three popular algorithms for replicating changes between nodes:

Single-leader replication

Multi-leader replication
Leaderless replication

Earlier, we learned about Leader-Based Replication (also called Single-Leader Replication).


In Single-Leader Replication:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 78/141
Only one leader handles all the writes.
All changes (writes) must go through that one leader.

But there are problems with that!

Problem with Single-Leader Replication


There is a single bottleneck: All writes must go to that one leader.

If there is a network interruption between the user and the leader, no writes are allowed.
(Meaning: users cannot save or update data until the connection is fixed.)

Solution: What if we have MORE than one leader?


What if more than one node is allowed to accept writes?
This idea leads to Multi-Leader Replication.
In Multi-Leader Replication:

Each node that processes a write must forward that data change to all the other nodes.
Each leader also acts like a follower to the other leaders.

This type of replication is also called:

Master/Master Replication
Active/Active Replication

📚 Use Cases for Multi-Leader Replication


Where can we use it?

1. Multi-Datacenter Operations

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 79/141
Each datacenter (location) has its own leader.
Inside the datacenter: Normal leader-follower replication.
Between datacenters: Leaders share updates with each other.
2. Clients with Offline Operation
Every device (like a mobile app) has its own local database.
It acts like a leader.
When the device connects to the internet again, it shares its updates.
3. Collaborative Editing

Example: Google Docs where multiple users edit at the same time.
Each person’s changes are written locally and then synced with others.

🌩️ Multi-Leader Replication Challenges


There are problems too!
Main Problem: Write Conflicts

What if two leaders change the same record at the same time?
How do we know which change is correct?

Ways to solve this:

1. Conflict Avoidance
Make sure that all writes for a particular record go through the same leader.
2. Converging Toward a Consistent State
Give each write a unique ID (e.g., a timestamp, random number, UUID, etc.)
Pick the highest ID as the winner.
Throw away the others.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 80/141
3. Custom Conflict Resolution Logic
Write your own logic inside your application.
You can decide how to handle conflicts either when writing the data or when reading the data.

🥊 Single-Leader vs. Multi-Leader Replication


Performance

Single-Leader:
Every write has to travel across the internet to the leader’s datacenter.
This can make writes slow because of internet delay.
Multi-Leader:
Writes can be processed locally in the nearest datacenter.
Changes are replicated asynchronously to other datacenters.

Users see faster performance because they don’t wait for the internet.

Tolerance of Datacenter Outages

Single-Leader:

If the datacenter with the leader fails, you must promote a new leader in another datacenter.
Multi-Leader:
Each datacenter can continue working independently even if others fail.
Replication will catch up later when the failed datacenter comes back.

Tolerance of Network Problems

Traffic between datacenters usually goes over the public internet, which is less reliable.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 81/141
Single-Leader:
Very sensitive to network problems because writes are made synchronously.
Multi-Leader:
Better tolerance: Even if the network goes down temporarily, local writes can still happen.
Users won't be stuck waiting.

🌐 Multi-Leader Replication Topologies


A topology means the way nodes (computers) are connected and how they send updates to each other.
Different types:

1. Circular Topology:
Each node receives writes from one node.
It forwards those writes (plus its own) to another node.
Like a ring.
2. Star Topology:
One root node forwards all writes to all other nodes.

Like a hub with spokes.


3. All-to-All Topology:
Every node talks to every other node.
Messages can travel along different paths, avoiding a single point of failure.

✨ Summary for Newbie


Replication = copying data across machines.
Multi-Leader Replication = many leaders can accept writes.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 82/141
Good for: offline work, multiple datacenters, collaborative apps.
Challenges = handling conflicts when two leaders change the same data.
Topologies = how nodes are connected to forward changes (Ring, Star, All-to-All).

Would you like me to also draw a simple diagram to show Circular, Star, and All-to-All topology? 📈
It'll make it even easier to understand visually!🌟

Replication (Recap)
• We looked at Replication as a means of keeping a copy of the same data on multiple machines that are
connected via a network.
• The data and the metadata are replicated for reasons like:
1. To keep data geographically close to the users (and thus reduce latency)
2. To allow the system to continue working even if some of its parts have failed (and thus increase
availability)
3. To scale out the number of machines that can serve read queries (and thus increase read throughput)
• We discussed that every write needs to be processed by every replica; otherwise, the nodes will not hold the
same data.
• We discussed on the two of the three popular algorithms for replicating changes between nodes:
• Leader based or single-leader based replication
• Synchronous Replication
• Asynchronous Replication
• Multi-leader
• Leaderless replication
CLOUD COMPUTING
Replication (Recap – 2)
▪ As part of the Leader based replication, we discussed one of the replicas would be designated as a leader
and typically all Users/clients wanting to write data to storage, would send their requests to the leader,
which first writes the new data to its local storage. The leader then sends the data change to all of the other
replicas known as followers using some kind of replication log. These followers are also call read replicas.
▪ We also discussed that potential issues like the leader failure, follower failure or the replication lag which is
the lag between the write happening on the leader and getting replicated on the follower or that eventually

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 83/141
up-dation would happen leading to eventual concurrency across the nodes. We also discussed on simple
techniques which were used to address the same.
▪ We also saw that a single leader was a bottleneck for the environment and looked at having multiple
leaders. We saw the advantages it would provide in scenarios like data centers, we saw the challenges it
would bring in like synchronization and looked at a few simple approaches which could be used to address
the same.
▪ We also contrasted the single leader and multi-leader replications in terms of performance, tolerance to
data center failures, network failures
CLOUD COMPUTING
Leaderless Replication
• In contrast to the Single Leader and Multileader replication strategies, Leaderless replication adopts a
philosophy that Nodes in the leaderless setting are considered peers
• All the nodes accept writes & reads from the client.
• Without the bottleneck of a leader that handles all write requests, leaderless replication offers better
availability.
• A leaderless replication is an architecture where every write must be sent to every replica.
• A write is considered to be successful when the write is acknowledged by (a quorum of) at least k out of n
replicas and the read is considered successful in reading a particular value, when (a quorum of) at least k out
of n reads must agree on a value
• It has the advantage of parallel writes
CLOUD COMPUTING
Leaderless Replication
Writes with Leaderless Replication
• Upon a write, the client
broadcasts the request to all
replicas instead of a
special node (the leader) and
waits for a certain number of
ACKs.
• The write is completed and
considered successful when the
write must be acknowledged by at
least k out of n replicas.
(E.g. in the figure when two of
the three replicas send ACK to the

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 84/141
client)
CLOUD COMPUTING
Leaderless Replication
Reads with Leaderless Replication
• Upon read, the client contacts all replicas & waits for some number of responses.
• In order to be considered successful in reading a particular value, at least k out of n reads must agree on a
value. E.g read request is completed when two out of the three replicas returns the latest version.
• This approach of the client waiting for
many (quorum of) responses, the
approach is known as quorum or the
value k is known as quorum.
• k value when you are reading is known
as read quorum and k value when you
are writing is known as write quorum
• This quorum can be configured in a
way to provide consistency to the
copies.
CLOUD COMPUTING
Leaderless Replication
Reads with Leaderless Replication (another example)
CLOUD COMPUTING
Leaderless Replication : Potential Pitfalls - 1
Writing to the Database When a Node Is Down
• Consider three replicas with one of the replicas being currently unavailable.
• The client sends the write to all three replicas in parallel, and the two available replicas accept the write but the
unavailable replica misses it.
• We consider the write to be successful since two out of three replicas have acknowledged the write
• The client simply ignores the fact that one of the replicas missed the write
• Suppose the unavailable node comes back online and clients start reading from it. Any writes that happened while the
node was down will be missing from that node. Thus, stale (outdated) values may be read from that node as response.
An approach to address the same
• To solve this problem, read requests are also sent to several nodes in parallel.
• The client may get different responses from different nodes; i.e., the up-to-date value from one node and a stale value
from another.
• Version numbers can be used to determine which value is newer

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 85/141
CLOUD COMPUTING
Leaderless Replication – Pitfalls - 2
How does a node catch up on the writes that it missed?
Read repair
• When a client makes a read from several nodes in parallel, it
can detect any stale responses. User 2345 gets a version 6 value
from replica 3 and a version 7 value from replicas 1 and 2.
• The client sees that replica 3 has a stale value and writes the
newer value back to that replica.
Anti-entropy process
• Some datastores have a background process that constantly looks for differences in the data between
replicas and copies any missing data from one replica to another
• There are approaches where there are agents run on the replica which periodically match their states and
actively propagate the same (Using Gossip kind of approach)
CLOUD COMPUTING
Leaderless Replication : Pitfalls - 3
Detecting Concurrent Writes
• Databases allow several clients to concurrently write to the same key, which means that conflicts will occur
even if strict quorums are used
• Events may arrive in a different order at different nodes, due to variable network delays and partial failures
• If each node simply overwrote the value for a key whenever it received a write request from a client, the
nodes would become permanently inconsistent
• In order to become eventually consistent, the
replicas should converge toward the same value
Approaches to Address this
• Usage of timestamps to resolve these write conflicts
can be done. This could lead to additional challenges
as clocks may not be synchronized.
CLOUD COMPUTING
Leaderless Replication : Pitfalls - 3
Detecting Concurrent Writes (Cont.)
• Last write wins (discarding concurrent writes) - One approach for achieving eventual convergence is to declare
that each replica need only store the most “recent” value and allow “older” values to be overwritten and
discarded.
• The “happens-before” relationship and concurrency - How do we decide whether two operations are

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 86/141
concurrent or not?
• An operation A happens before another operation B if B knows about A, or depends on A, or builds upon A
in some way. Whether one operation happens before another operation is the key to defining what
concurrency means.
• Server can determine whether two operations are concurrent by looking at the version numbers
• In case of multiple replicas, we need to use a version number per replica as well as per key stored in a
version vector.
(cont.)
CLOUD COMPUTING
Leaderless Replication : Pitfalls - 3
Algorithm for determining whether two operations are concurrent by looking at the version numbers
• The server maintains a version number for every key, increments the version number every time that key is
written, and stores the new version number along with the value written.
• When a client reads a key, the server returns all values that have not been overwritten, as well as the latest
version number. A client must read a key before writing.
• When a client writes a key, it must include the version number from the prior read, and it must merge together
all values that it received in the prior read. (The response from a write request can be like a read, returning all
current values, which allows to chain several writes.)
• When the server receives a write with a particular version number, it can overwrite all values with that version
number or below (since it knows that they have been merged into the new value), but it must keep all values
with a higher version number (because those values are concurrent with the incoming write).
CLOUD COMPUTING
Leaderless Replication : Quorum Consistency
A common choice is to make n an odd number (typically 3 or 5) and to set w = r = (n + 1) / 2
CLOUD COMPUTING
Leaderless Replication
Monitoring staleness
• For leader-based replication, the database typically exposes metrics for the replication lag. By
subtracting a follower’s current position from the leader’s current position, you can measure the
amount of replication lag.
• For leaderless replication, there is no fixed order in which writes are applied, which makes monitoring
more difficult
CLOUD COMPUTING
Leaderless Replication
Multi-datacenter operation

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 87/141
• Leaderless replication is also suitable for multi-datacenter operation, since it is designed to tolerate conflicting
concurrent writes, network interruptions and latency spikes.
• The number of replicas n includes nodes in all datacenters, and the number of replicas you want to have in
each datacenter can be configured.
• Each write from a client is sent to all replicas, regardless of datacenter, but the client usually only waits for
acknowledgment from a quorum of nodes within its local datacenter so that it is unaffected by delays and
interruptions on the cross-datacenter link.
• The higher-latency writes to other datacenters are often configured to happen asynchronousl

Explainn for newbie /.Explain the above for newbie in depth with above word and letters and sentences

Of course! I’ll explain this carefully, step-by-step, using the same words, letters, and sentences you gave, but in a way that even a beginner (newbie) can
understand easily.

🧠 Replication (Recap)
Replication is copying the same data to multiple machines connected over a network.
Why do we replicate data?
Three reasons:

1. Reduce latency ➔ Keep data closer to users so they get faster responses.
2. Increase availability ➔ If one machine fails, others still have the data and system keeps working.
3. Increase read throughput ➔ More machines can answer read queries at the same time.

Important thing: Every write (change in data) must be processed by every replica, or else the copies will not match.

📘 Popular Algorithms for Replication


There are three types:

1. Single-Leader (Leader-based) Replication

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 88/141
2. Multi-Leader Replication
3. Leaderless Replication

Let's go step-by-step.

🔹 Leader-based (Single Leader) Replication


One machine is chosen as the Leader.
All clients/users send their write requests to the leader.
The leader first saves the data on its own storage.
Then, it sends changes to followers (also called read replicas).

👉 Problems that can happen:


Leader failure (what if leader crashes?).
Follower failure (what if a follower can't get updates?).

Replication lag (time delay between leader writing and followers copying).

👉 We discussed some simple solutions for these problems.


Biggest issue:

Single leader becomes a bottleneck (too much load on one machine).

🔹 Multi-Leader Replication
More than one machine acts as a leader.
Multiple leaders can accept writes at the same time.
Leaders share their changes with other leaders.

👉 Use cases:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 89/141
Multi-Datacenter Operation ➔ Each datacenter has a local leader.
Clients with offline operation ➔ Devices store data locally and sync later.
Collaborative editing ➔ Many people edit a document together.

👉 Challenges:
Write Conflicts ➔ Two leaders might get different updates for the same record.

How to solve conflicts?

Conflict avoidance ➔ Make sure all writes for one record go to the same leader.
Converging to a consistent state ➔ Use unique IDs (timestamps, UUIDs, etc.) and pick the "winner" write.
Custom conflict resolution ➔ Write custom code to handle conflicts.

🔹 Leaderless Replication
Here, no leader at all!

All nodes are equal (peers).


All nodes can accept writes and reads directly from clients.
No single point that could become a bottleneck.

✍ Writes in Leaderless Replication


Client sends write to all replicas.
Client waits for k replicas (quorum) to send ACK (acknowledgement).
Example:
3 replicas (n=3).
Need 2 out of 3 ACKs (k=2).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 90/141
📖 Reads in Leaderless Replication
Client contacts all replicas to read.
Waits for k replicas to agree on the value.
Quorum = the minimum number of responses needed.

👉 Write quorum = k value during writes.


👉 Read quorum = k value during reads.
⚡ Leaderless Replication: Challenges (Pitfalls)
1. Node is Down During Write
If one replica is down, it misses the write.
When it comes back, it has old data (stale data).

Solution:

During reads, contact multiple replicas and use version numbers to find the newest data.

2. How does a Node Catch Up?


Two solutions:

Read Repair ➔ During a read, if a node has old data, the client fixes it by writing the latest data back.
Anti-Entropy Process ➔ Background processes compare and fix differences automatically.

(Example: like "gossip" where nodes share updates.)

3. Detecting Concurrent Writes

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 91/141
Concurrent writes can happen when two clients update the same data at the same time.
Different nodes might see updates in a different order.
If nodes just overwrite without thinking, data would become inconsistent.

Solution:

Use timestamps to decide which write is newer.


Handle happens-before relationships using version numbers.
Version vectors (track version per replica) can help understand which writes are concurrent.

✅ Leaderless Replication: Quorum Consistency


Choose an odd number of replicas (example: 3 or 5).
Set quorum sizes so that:
Write quorum (w) + Read quorum (r) > number of replicas (n).

Example:

n=3
w=2
r=2

Thus, w + r = 4 > 3 ➔ Good consistency.

📈 Monitoring Staleness
In leader-based:

Measure replication lag easily.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 92/141
In leaderless:

No fixed order of writes.


Harder to monitor.

🌍 Multi-Datacenter Operation (Leaderless)


Leaderless works well even across different datacenters.
Writes are sent to all replicas, even in different locations.
Client only waits for quorum from its local datacenter.
Writes to far-away datacenters are asynchronous (don't block the client).

🌟 Summary for Newbie


Type Who handles writes? Advantage Challenge

Single Leader One leader Simple Bottleneck


Multi-Leader Multiple leaders Fast, distributed Write conflicts

Leaderless Every node Highly available Harder consistency

Would you also like me to make a super simple diagram to show all three types? 📈
It might make it even easier to visualize! 🎨🚀

CLOUD COMPUTING
Consistency and Eventual Consistency
▪ The term consistency refers to the consistency of the values in different copies of the same data item
in a replicated distributed system.
▪ This consistency can be lost when there is a network issue or there is a difference in the time taken
to write into different copies of the same data.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 93/141
▪ A consistency model is contract between a distributed data store and processes, in which the
processes agree to obey certain rules in contrast the store promises to work correctly.
▪ There are applications and environments which need strong consistency i.e. values across copies to
be the same, which will need to use different mechanisms to achieve the same. These mechanisms
can lead to transactions slowing down and hence having an impact on the performance.
▪ A consistency model basically refers to the degree of consistency that should be maintained for the
shared memory data
CLOUD COMPUTING
Consistency Models – Consistency Guarantees
▪ Eventual consistency is designing systems to eventually guarantee the copies of data to be
consistent once all the current operations have been processed, but don’t always have to
be identical. This provides improved performance.
• Most replicated databases provide eventual consistency, which means that if you stop
writing to the database and wait for some unspecified length of time, then eventually all
read requests will return the same value. That is, all replicas will eventually converge to the
same value
• This is a very weak guarantee as it doesn’t say anything about when the replicas will
converge
• The edge cases of eventual consistency only become apparent when there is a fault in the
system or at high concurrency.
CLOUD COMPUTING
Consistency Models – Different Types
Sequential Consistency (Lamport)
▪ A shared-memory system is said to support the sequential
consistency model, if all processes see the same order of all
memory access operations on the shared memory. - Instr 1- Instr 2- Instr 3
▪ The exact order in which the memory access operations are
interleaved does not matter... If one process sees one of the
orderings of ... three operations and another process sees a
different one, the memory is not a sequentially consistent memory.- Instr 1- Instr 2- Instr 3- Instr 1- Instr 1- Instr 2- Instr 1- Instr 2- Instr 3- Instr 1- Instr 3-
Instr 2- Instr 3- Instr 2- Instr 3- Instr 1- Instr 2- Instr 3- Instr 1- Instr 2- Instr 3
NO- Instr 1- Instr 1- Instr 3- Instr 1- Instr 2- Instr 3- Instr 1- Instr 2- Instr 2- Instr 3- Instr 2- Instr 3
▪ Conceptually there is one global memory and a switch that connects an arbitrary processor
to the memory at any time. Each processor issues memory operations in program order and
the switch provides the global serialization among all the processors.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 94/141
▪ Therefore: Sequential consistency is not deterministic because multiple execution of the
distributed program might lead to a different order of operations.
CLOUD COMPUTING
Consistency Models – Different Types
Causal Consistency
▪ Two events are causally related if one can influence the other
▪ Relaxes the requirement of the sequential model for better concurrency. Unlike the
sequential consistency model, in the causal consistency model, all processes see only those
memory reference operations in the same (correct) order that are potentially causally
related.
▪ Memory reference operations that are not potentially causally related may be seen by
different processes in different orders.
PRAM (Pipelined Random-Access Memory) consistency
▪ It ensures that all write operations performed by a single process are seen by all other
processes in the order in which they were performed as if all the write operations
performed by a single process are in a pipeline.
▪ Write operations performed by different processes may be seen by different processes in
different orders
CLOUD COMPUTING
Consistency Models – Different Types
Strict Consistency (also called Strong consistency or Linearizability)
▪ A shared-memory system is said to support the strict consistency model if the value returned
by a read operation on a memory address is always the same as the value written by the
most recent write operation to that address, irrespective of the locations of the processes
performing the read and write operations
▪ Like sequential consistency, but the execution order of programs between processors must
be the order in which those operations were issued.
▪ Therefore: If each program in each processor is deterministic, then the distributed program
is deterministic.
CLOUD COMPUTING
Linearizability
• Basic idea is to make a system appear as if there were only one copy of the data, and all
operations on it are atomic.
• With this guarantee, even though there may be multiple replicas in reality, the application
does not need to worry about them.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 95/141
• Also known as atomic consistency, strong consistency, immediate consistency or external
consistency
• Linearizability is a recency guarantee: a read is guaranteed to see the latest value written.
• In a linearizable system, as soon as one client successfully completes a write, all clients
reading from the database must be able to see the value just written.
• Maintaining the illusion of a single copy of the data means guaranteeing that the value read
is the most recent, up-to-date value and doesn’t come from a stale cache or replica
CLOUD COMPUTING
Linearizability
• If a read request is concurrent
with a write request, it may
return either the old or the
new value
• After any one read has returned the new value, all following reads (on the same or other
clients) must also return the
new value.
CLOUD COMPUTING
Linearizability
Compare and Set (cas)
• Add a third type of operation besides read and write
• cas(x, vold, vnew) ⇒ r means the client requested an atomic compare-and-set operation . If
the current value of the register x equals vold, it should be atomically set to vnew. If x ≠ vold
then the operation should leave the register unchanged and return an error. r is the
database’s response (ok or error).
CLOUD COMPUTING
Linearizability
Visualizing the points in time at which the reads and writes appear to have taken effect
• The final read by B is not linearizable
• It is possible to test whether a system’s
behavior is linearizable by recording the
timings of all requests and responses
and checking whether they can be
arranged into a valid sequential order
CLOUD COMPUTING
Linearizability

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 96/141
Single-leader replication (potentially linearizable)
• If you make reads from the leader or from synchronously updated followers, they have the
potential to be linearizable
Consensus algorithms (linearizable)
• Consensus protocols contain measures to prevent split brain and stale replicas.
• Consensus algorithms can implement linearizable storage safely
• Example: ZooKeeper
Multi-leader replication (not linearizable)
Leaderless replication (probably not linearizable

Explain the above for newbie in depth with above word and letters and sentences

Of course! I’ll explain everything you posted slowly, in-depth, using the same words and flow you gave — but making it much easier for a newbie to
understand.
I'll break it into small steps and short sentences, just like you're learning from scratch.

☁️ CLOUD COMPUTING
Consistency and Eventual Consistency (Simple Explanation)

🧠 What is Consistency?
Consistency means:
➔ When you have multiple copies of the same data (saved on different machines), all copies should have the same value.
Example:
➔ Imagine you have a file saved in Mumbai, London, and New York.
➔ If you change the file in Mumbai, the copy in London and New York should also soon have the same update.
➔ This matching between copies is called consistency.

❌ How Can Consistency Break?


Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 97/141
If there is a network problem or

If it takes different amounts of time to write to different copies,


Then the copies may not match immediately.

📜 What is a Consistency Model?


A consistency model is like a promise between: ➔ the database (or storage system)
➔ and the apps or users using it.
It defines rules about: ➔ How and when the copies should match.
➔ How apps should behave to make things smooth.

🔥 Strong Consistency vs Performance


Some apps need strong consistency.
➔ Example: Banking apps, stock trading apps.
Strong consistency is hard to achieve and can slow down performance.
➔ Because every write must reach everywhere immediately before you can continue.

☁️ Consistency Models – Consistency Guarantees


🌟 What is Eventual Consistency?
Eventual Consistency means:
➔ Right after a write, the copies may not match.
➔ But after some time, once all the updates are done, the copies will match.
It improves performance because it doesn't wait for every copy to update before continuing.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 98/141
Important about Eventual Consistency:

Most databases today use eventual consistency.


When will all the copies match?
➔ No one knows exactly. It’s "eventual" (later).
Problems mostly show up when: ➔ There is a system fault (server crash, network issues)
➔ There is a lot of concurrency (many users updating at once).

☁️ Consistency Models – Different Types


1️⃣ Sequential Consistency (Lamport)
Idea:
➔ All users see memory operations (reads/writes) in the same order.
Example:
➔ Imagine instructions 1, 2, 3 happen.
➔ All users must see some order like 1-2-3, or maybe 2-3-1,
but everyone sees the SAME order.
Key Point:
➔ The order doesn't have to match the real-time order.
➔ But all users must agree on whatever order they see.

2️⃣ Causal Consistency


Idea:
➔ If one action can affect another, they are causally related.
Causal Consistency says: ➔ You must keep the order only for causally related events. ➔ Other independent events can be seen in different orders.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 99/141
Better concurrency because it relaxes strict rules compared to sequential consistency.

3️⃣ PRAM Consistency (Pipelined RAM)


Idea:
➔ All writes by one process must appear in order.
➔ Different processes’ writes can appear in different orders.
Example:
➔ If I (Process A) wrote X then Y, everyone must see X then Y. ➔ But Process B's writes can be seen differently by others.

4️⃣ Strict Consistency (also called Strong Consistency)


Idea:
➔ As soon as a write happens, all readers must see that new value immediately.
No delays allowed.
Very strict, but also very slow.
Perfect for banking systems where you can't afford any mismatch.

☁️ Linearizability
🧩 What is Linearizability?
Linearizability is like an even stricter version of consistency.
It makes everything look like there is only ONE copy of the data.
Even if there are many copies behind the scenes,
➔ Apps feel like there's only one, and it’s always correct and up to date.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 100/141
Other names:
➔ Atomic consistency,
➔ Strong consistency,
➔ Immediate consistency,
➔ External consistency.

📚 In simple words:
After a write succeeds,
➔ All clients who read after that must see the new value.
You won't accidentally read old data.

🔀 Special Case: If a Read and Write Happen at Same Time?


It’s allowed that the read sees: ➔ Either the old value or
➔ The new value.
But once any read sees the new value,
➔ All future reads must also see the new value.

☁️ Compare and Set (cas)


cas(x, vold, vnew) means:
➔ "If the value of x is vold (old value), then change it to vnew (new value).
➔ If not, do nothing and return an error."
It is atomic. That means it happens fully or not at all.
Very important for avoiding conflicts when many users write at the same time.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 101/141
☁️ Visualizing Linearizability
You can test if a system is linearizable by: ➔ Recording the time when requests and responses happen. ➔ Checking if the events can be arranged into
a perfect sequence without conflicts.

☁️ How Different Replication Models Support Linearizability


Replication Type Is it Linearizable?

Single-leader replication Potentially Yes (if you read from the leader)
Consensus algorithms (like ZooKeeper) Yes, linearizable

Multi-leader replication No

Leaderless replication Probably not

🎯 Final Quick Recap


Term Meaning
Consistency Keeping copies same

Eventual Consistency Copies become same eventually

Sequential Consistency Same order for all users


Causal Consistency Order matters only if events are related

PRAM Consistency Same order for single process writes

Strict Consistency Always show latest write immediately


Linearizability Looks like only one true copy exists

Would you also like me to draw some simple diagrams showing these concepts (like linearizability and causal consistency)? 📚✨ It’ll make it even easier!
Would you like that? 🎨👨‍🏫
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 102/141
CAP Theorem
• The CAP theorem, originally introduced as the CAP principle, can be used to explain some of the
competing requirements in a distributed system with replication.
• It is a tool used to make system designers aware of the trade-offs while
designing “network shared-data systems” or A distributed system that stores
data on more than one node (physical/virtual machines) at the same time
• The three letters in CAP refer to three desirable properties of distributed
systems with replicated data:
• C -Consistency (among replicated copies)
• A -Availability (of the system for read and write operations)
• P-Partition tolerance (in the face of the nodes in the system being
partitioned by a network fault).
• The CAP theorem states that it is not possible to guarantee all three of the
desirable properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication. We can strongly support only two of the three properties
CAP Theorem Example
CLOUD COMPUTING
• Imagine a distributed system consisting of two nodes:
• The distributed system acts as a plain register with the value of
variable X.
• There’s a network failure that results in a network partition
between the two nodes in the system.
• An end-user performs a write request, and then a read request.
• Let’s examine a case where a different node of the system
processes each request. In this case, our system has two options:
• It can fail at one of the requests, breaking the system’s availability
• It can execute both requests, returning a stale value from the read request and breaking the system’s
consistency
• The system can’t process both requests successfully while also ensuring that the read returns the latest
value written by the write.
• This is because the results of the write operation can’t be propagated from node A to node B because of
the network partition.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 103/141
1. Consistency (C)Consistency means that the nodes will have the same copies of a replicated data item visible for various
transactions. It’s the guarantee that every node in a distributed cluster returns the same, most recently updated value of
a successful write at any logical time.
In CAP, the term consistency refers to the consistency of the values in different copies of the same data item in a
replicated distributed system. This can be verified if all reads initiated after a successful write return the same and latest
value at any given logical time.
▪ Performing a read operation will return the value of
the most recent write operation causing all nodes
to return the same data.
▪ A system has consistency if a transaction starts with the
system in a consistent state, and ends with the system in a
consistent state (may have an inconsistent state during a transaction,
but the entire transaction gets rolled back if there is an error).
▪ In the image, we have 2 different records (“Bulbasaur” and “Pikachu”) at
different timestamps. The output on the third partition is “Pikachu”,
the latest input although it will need time to update and be Available.
CAP Theorem
CLOUD COMPUTING
CAP Theorem
CLOUD COMPUTING
2. Availability (A)Availability means that each read or write request for a data item from a client to a
node will either be processed successfully or will receive a message that the operation cannot be
completed (if not in failed state).
▪ All working nodes in the distributed
system return a valid response for any
request, without exception
▪ Achieving availability in a distributed system
requires that the system remains operational
100% of the time, which may need that we have
“x” servers beyond the “n” servers serving our
application
CLOUD COMPUTING
CAP Theorem
3. Partition Tolerance (P) Partition tolerance requires that a system be able to re-route a
communication when there are temporary breaks or failures in the network (network partitions).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 104/141
▪ It means that the system continues to function and upholds its consistency guarantees in spite of
network partitions.
▪ Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once
the partition heals.
▪ This condition states that the system continues to run, despite the number
of messages being delayed by the network between nodes.
▪ A system that is partition-tolerant can sustain
any amount of network failure that doesn’t
result in a failure of the entire network.
▪ Data records are sufficiently replicated across
combinations of nodes and networks to keep the
system up through intermittent outages.
CLOUD COMPUTING
CAP Theorem : Tradeoffs between Requirements (Illustrations)
1. Availability and Partition-Tolerant (Compromised Consistency): Say you have two nodes and the link
between the two is severed. Since both nodes are up, you can design the system to accept requests on
each of the nodes, which will make the system available despite the network being partitioned.
However, each node will issue its own results, so by providing high availability and partition tolerance
you’ll compromise consistency.
2. Consistent and Partition-Tolerant (Compromised Availability): Say you have three nodes and one
node loses its link with the other two. You can create a rule that, a result will be returned only when a
majority of nodes agree. So In-spite of having a partition, the system will return a consistent result, but
since the separated node won’t be able to reach consensus it won’t be available even though it’s up.
3. Consistent and Available (Compromised on a Partition-Tolerance): Although, a system can be
both consistent and available, but it may have to block on a partition.
CLOUD COMPUTING
CAP Theorem : Illustrative Examples of products/Application categories to choose
CLOUD COMPUTING
CAP Theorem
• The “Pick Two” expression of CAP opened the minds of designers to a wider range of
systems and tradeoffs
• CAP is NOT a choice “at all times” as to “which one of the three guarantees to abandon”.
In fact, the choice is between consistency and availability only when a network partition or
failure happens. When there is no network failure, both availability and consistency can be
satisfied

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 105/141
• Products like DB e.g. Casandra supports AP but provides eventual consistency by allowing
clients to write to any nodes at any time and reconciling inconsistencies as quickly as
possible
• Reference: https://www.youtube.com/watch?v=9uCP3qHNbWw
CLOUD COMPUTING
CAP Theorem and some of the criticisms
• Although the Theorem doesn't specify an upper bound on response time for availability, in
practice, there's exists a timeout. CAP Theorem ignores latency, which is an important
consideration in practice. Timeouts are often implemented in services. During a partition, if we
cancel a request, we maintain consistency but forfeit availability. In fact, latency can be seen as
another word for availability.
• In NoSQL distributed databases, CAP Theorem has led to the belief that eventual consistency
provides better availability than strong consistency. It could be considered as an outdated notion.
It's better to factor in sensitivity to network delays.
• CAP Theorem suggests a binary decision. In reality, it's a continuum. It’s more of a trade-off. There
are different degrees of consistency implemented via "read your writes, monotonic reads and
causal consistency".
CLOUD COMPUTING
CAP Theorem and how do you use them when making decisions
• CAP Theorem can feel quite abstract, but both from a technical and business perspective the trade
offs will lead to some very important questions which has practical, real-world consequences.
• These questions would be
• Is it important to avoid throwing up errors to the client?
• Or are we willing to sacrifice the visible user experience to ensure consistency?
• Is consistency an actually important part of the user’s experience
• Or can we actually do what we want with a relational database and avoid the need for partition
tolerance altogether?
• All of these are ultimately leading to user experience questions.
• This will need understanding of the overall goals of the project, and context in which your database
solution is operating. (E.g. Is it powering an internal analytics dashboard? Or is it supporting a
widely used external-facing website or application?)
CLOUD COMPUTING
CAP Theorem and its relationship to Cloud Computing
• Microservices architecture for applications is popular and prevalent on both cloud servers and on
premises data centers

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 106/141
• We have also seen that microservices are loosely coupled, independently deployable application
components that incorporate their own stack, including their own database and database model,
and communicate with each other over a network.
• Understanding the CAP theorem can help you choose the best database when designing a
microservices-based application running from multiple locations.
E.g. If the ability to quickly iterate the data model and scale horizontally is essential to your
application, but you can tolerate eventual (as opposed to strict) consistency, an database like
Cassandra or Apache CouchDB (supporting AP - Availability and Partitioning) can meet your
requirements and simplify your deployment. On the other hand, if your application depends
heavily on data consistency—as in an eCommerce application or a payment service—you might opt
for a relational database like PostgreSQL.
CLOUD COMPUTING
CAP Theorem Conclusion
• Distributed systems allow us to achieve a level of computing power and availability that were
simply not available in the past.
• Our systems have higher performance, lower latency, and near 100% up-time in data centers that
span the entire globe.
• These systems of today are run on commodity hardware that is easily obtainable and configurable
at affordable costs.
• The disadvantage to it though is Distributed systems are more complex than their single-network
counterparts.
• Understanding the complexity incurred in distributed systems, making the appropriate trade-offs
for the task at hand (CAP), and selecting the right tool for the job is necessary with horizontal
scaling

The CAP Theorem helps explain the challenges in designing distributed systems, especially in those with replicated data. It outlines a key tradeoff between
three properties that are crucial for distributed systems:

1. Consistency (C)

Consistency means that all nodes (servers) in the system have the same data at any given time. When a write operation occurs, every node must reflect
the same value immediately.
For example, imagine you are reading from one node while a write operation happens on another node; you will always see the most up-to-date value.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 107/141
However, achieving Consistency can come at the expense of Availability or Partition Tolerance in certain cases.

2. Availability (A)

Availability ensures that every request (either read or write) will receive a response, even if one of the nodes is down.
This means that the system is always operational. If one node is unavailable, other nodes can still handle the request, ensuring that the system doesn't
stop functioning.
The downside is that if the system prioritizes Availability, it might not always provide the most up-to-date information, leading to inconsistent data at
different nodes.

3. Partition Tolerance (P)

Partition tolerance means that the system can continue functioning even if network partitions occur, meaning nodes are temporarily unable to
communicate with each other.

This is crucial because in a distributed system, network failures can happen, and the system needs to be able to handle these situations without failing.
A system that supports Partition Tolerance ensures that it can keep functioning even when part of the network is unavailable, but this could affect
Consistency or Availability.

CAP Theorem: The Core Idea

The theorem states that it’s impossible for a distributed system to guarantee all three of these properties (Consistency, Availability, and Partition
Tolerance) at the same time.
This means, during a network partition (failure), you can choose at most two of the following:
Consistency and Availability (but without Partition Tolerance)
Consistency and Partition Tolerance (but compromised Availability)
Availability and Partition Tolerance (but compromised Consistency)

Example to Illustrate CAP:

Imagine two nodes (A and B) in a distributed system with a network partition between them.
Option 1 (Failing at a Request): The system can stop processing one of the requests (either the write or read), breaking availability.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 108/141
Option 2 (Compromising Consistency): The system could process both the write and the read, but the read could return outdated data from node
B since the partition prevented synchronization, leading to inconsistent data.

Trade-Offs in CAP:

Availability and Partition Tolerance (compromising Consistency): This choice might mean that the system keeps operating, even when the data on
the nodes is not synchronized. An example of this is Cassandra, which prioritizes Availability and Partition Tolerance but allows for eventual
consistency.
Consistency and Partition Tolerance (compromising Availability): This choice prioritizes consistent data even when parts of the network are down.
For example, some systems ensure that data across all nodes is consistent but might not be available during network failures.
Consistency and Availability (compromising Partition Tolerance): This scenario is less common in distributed systems, but it can happen when
network partitions aren't a major concern.

Real-World Use Cases:

Cassandra (AP system): This NoSQL database focuses on Availability and Partition Tolerance, meaning it may not always return the most up-to-date
data, but it will always be available.
PostgreSQL (CP system): For systems like online payment processing, where Consistency and Partition Tolerance are prioritized, a relational database
like PostgreSQL might be used. It guarantees strong consistency, but in cases of network partition, it may sacrifice availability.

Criticisms of CAP Theorem:

The CAP theorem oversimplifies reality. For example, it doesn’t account for latency, which is often a significant factor in system performance.
It suggests that you must always choose two guarantees, but in practice, there are shades of gray. Systems can implement different levels of
consistency (like eventual consistency, monotonic reads, or read-your-writes).
Eventual consistency (as seen in NoSQL databases like Cassandra) doesn't always mean weak consistency; it may allow for a balance between
performance and correctness.

Key Takeaways for System Designers:

Microservices Architecture: When designing microservices, it’s crucial to understand CAP because microservices often involve distributed systems. A
system's data model will impact how databases are chosen.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 109/141
Business Decisions: From a business perspective, choosing which guarantees to prioritize depends on your application. For example, an e-commerce
platform requires Consistency, while a social media platform may tolerate eventual consistency for the sake of availability.

Conclusion:

The CAP Theorem teaches system designers to understand the trade-offs when building distributed systems. It forces a decision between Consistency,
Availability, and Partition Tolerance, and these decisions should be made based on the specific needs of the application. Ultimately, no distributed system
can guarantee all three, so you must prioritize based on your requirements.

Transactions
• Transaction is an operation composed of a number of discrete steps
• All the steps must be completed for the transaction to be committed. The results are made
permanent else the transaction is aborted and the state of the system reverts to what it was before
the transaction started E.g. Buying a house
• Basic Operations
• Transaction primitives:
• Begin transaction: mark the start of a transaction
• End transaction: mark the end of a transaction; try to commit
• Abort transaction: kill the transaction, restore old values
• Read/write data from files (or object stores): data will have to be restored if the transaction is
aborted.
CLOUD COMPUTING
Transactions : Properties of Transactions (ACID)
ACID :
▪ Atomic
• The transaction happens as a single indivisible action. Others do not see intermediate results. All
or nothing.
▪ Consistent
• If the system has invariants, they must hold after the transaction. E.g., total amount of money in
all accounts must be the same before and after a “transfer funds” transaction.
▪ Isolated (Serializable)
• If transactions run at the same time, the final result must be the same as if they executed in some

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 110/141
serial order.
▪ Durable
• Once a transaction commits, the results are made permanent. No failures after a commit will
cause the results to revert.
CLOUD COMPUTING
Transactions and different types of transactions :
Nested Transactions
• Nested Transaction is a top-level transaction which may create sub-transactions
• Problem:
• Sub-transactions may commit (results are durable) but the parent transaction may abort.
• One solution : private workspace
• Each sub-transaction is given a private copy of every object it manipulates. On commit, the private copy
displaces the parent’s copy (which may also be a private copy of the parent’s parent)
Distributed Transactions
• A distributed transaction is a set of operations on data that is
performed across two or more data repositories or resources
across different systems
• Challenge: handle machine, software, & network failures
while preserving transaction integrity.
• There are two possible outcomes all operations successfully
complete, or none of the operations are performed at all due to a failure somewhere in the system
• In the second outcome if some work was completed prior to the failure, that work will be reversed to ensure no
net work was done. This type of operation is in compliance with the “ACID” principles to ensure data integrity
CLOUD COMPUTING
Distributed Transaction System Architecture
▪ A system architecture supporting distributed transactions, has multiple data repositories hosted on different nodes
connected by a network.
▪ Transaction may access data at several nodes/sites.
▪ Each site has a
local transaction manager responsible for:
▪ Maintaining a log for recovery purposes
Application
TR Co-ordinator
TR Manager
▪ Participating in coordinating the concurrent execution of the

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 111/141
transactions executing at that site.
TR Manager
Node
▪ Responsible for sub-transactions on that system
▪ Performs prepare, commit and abort calls for sub-transactions
Node
▪ Each sub-transaction must agree to commit changes before the transaction can complete
Transaction coordinator coordinating activities across the data repositories
▪ Periodically a local transaction manager is nominated as a local coordinator
▪ Starting the execution of transactions that originate at the site.
▪ Distributing sub transactions at appropriate sites for execution.
TR Manager
Node
▪ Coordinating the termination of each transaction that originates at the site, which may result in the transaction
being committed at all sites or aborted at all sites.
CLOUD COMPUTING
Concurrency Control
• All concurrency mechanisms must preserve data consistency and complete each atomic
action in finite time
• Important capabilities are
a) Beresilient to site and communication link failures.
b) Allow parallelism to enhance performance requirements.
c) Incur optimal cost and optimize communication delays
d) Place constraints on atomic action.
CLOUD COMPUTING
Commit Protocols
• Commit protocols are used to ensure atomicity across sites
• A transaction which executes at multiple sites must either be committed at all the
sites, or aborted at all the sites.
• Not acceptable to have a transaction committed at one site and aborted at another.
• The two-phase commit (2 PC) protocol is widely used.
CLOUD COMPUTING
Two Phase Commit Protocol - Phase 1
Lets consider a transaction is named as “T”
1. The coordinator places a log record prepare T on the log at its

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 112/141
site.
2. The coordinator sends to each component’s site the message
prepare T.
3. Each site receiving the message prepare its component of the
transaction “T”.
4. If a site wants to commit its component, it must enter a state
called pre-committed with a Ready T message. Once in the pre
committed state, the site cannot abort its component of T
without a directive to do so from the coordinator
So, after prepare T is received, Perform whatever steps
necessary to be sure the local component of T will not have to
abort
If everything is fine and it ready to commit, then place the
record Ready T on the local log and flush the log to disk and
send Ready T messageto the coordinator
If the site wants to abort its component of T, then it logs the
record Don’t Commit T and sends the message Don’t commit T
to the coordinator
Two Phase Commit : Phase 1
Co-Ordinator
▪ Write prepare to commit
T to log at its site
▪ Send prepare to commit
message
▪ Wait for reply from each
related node
Related Nodes (Sites)
▪ Receive preparemessage
▪ Work on the components towards T
▪ If ready to commit get into pre
committed statefor T and place Ready T
in the local log and send Ready T message
to the coordinator (holds locks..)
▪ If not ready to commit place Don’t
commit T to the local log and send Don’t

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 113/141
commit Tmessage to the coordinator
▪ Wait for message from coordinator
Two Phase Commit Protocol -Phase 2
CLOUD COMPUTING
1. If the coordinator has received readyT fromall
componentsofT, thenitdecidestocommitT.The
coordinator logs Commit T at its site and then
sendsmessagecommitTtoallsitesinvolvedinT
2. If the coordinator has received don’t commit T
fromoneormoresites, it logsAbortTat itssite
and then sends abort Tmessages to all sites
involvedinT
3. IfasitereceivesacommitTmessage, itcommits
thecomponentofTatthatsite, releasesthelocks
..,loggingCommitTasitdoes.
4. IfasitereceivesthemessageabortT, itabortsT,
releaseslocksandwritesthelogrecordAbortT
Two Phase Commit : Phase 2
Co-Ordinator Related Nodes (Sites)
▪ If Ready Thas been received from
all nodes, then write commit T to
the local log and Send commit T
message
▪ If Don’t commit T is received from
any of the related nodes, then
write Abort T to the local log and
send Abort Tis sent to all related
nodes (sites)
▪ Wait for Donemessage and clear
up all states
(If in the Pre-Committed State, continue to
hold the locks)
▪ Receive Commit T or Abort Tmessage
▪ If Commit Tis received, commit the
component of T at the site, release locks ..

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 114/141
and place Commit T to local log and send
Done message to the co-ordinator
▪ If Abort Tis received, roll back all changes,
release locks .. and place Abort T to the
local log and send Donemessage to the
Co-ordinator
CLOUD COMPUTING
Two Phase Commit Protocol
Asuccessful execution of two-phase commi

Explain the above for newbie in depth with above word and letters and sentences

Understanding Transactions and ACID Properties for a Newbie

What is a Transaction?

A transaction is essentially a sequence of operations or steps that work together to complete a specific task. For example, if you want to buy a house, there
are several steps like checking the house, confirming the price, transferring money, and getting the legal papers ready. All these steps together make up a
single transaction.
The key point is that either all the steps must be completed, or none of them should be completed. If something goes wrong in the middle, the system
will abort the transaction and return everything to how it was before the transaction started.

Transaction Primitives:

Begin transaction: Marks the start of the transaction.


End transaction: Marks the end, where the system tries to commit the transaction.

Abort transaction: Cancels the transaction and restores everything to its previous state.
Read/write data: When a transaction reads or writes data, any changes will be canceled if the transaction is aborted.

ACID Properties:

The ACID properties ensure that the transaction is reliable, consistent, and can be trusted even if something goes wrong.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 115/141
1. Atomicity (A):
This means that the transaction is treated as a single unit of work.
Either all steps are completed (success), or none are. There are no partial transactions.
Think of it like buying a house. You can’t just pay the money without signing the papers. If one step fails (like not having enough money), the entire
transaction is canceled.
2. Consistency (C):
A transaction must leave the system in a valid state.
For example, when transferring money between two accounts, the total amount of money before and after the transaction should be the same. If
one account gains money, the other should lose it equally.
In simpler terms, the system should follow its rules after the transaction, meaning no errors or inconsistencies should happen.
3. Isolation (I):
If multiple transactions are running simultaneously, they should not interfere with each other.

This means if two transactions are happening at the same time (like two people trying to buy the same house), the result should be as if they
happened one after another (serially).
This ensures that one transaction’s changes do not affect another transaction in an unintended way.
4. Durability (D):
Once a transaction is committed (completed), its changes are permanent and will survive even if there is a system failure (like a power outage).
For example, once you buy a house, the agreement cannot be erased by a power failure.

Different Types of Transactions:

1. Nested Transactions:
A nested transaction is like a transaction within a transaction.
Example: You’re doing a house-buying transaction, and during this process, you may have other smaller sub-transactions (like checking if your
bank account has enough balance).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 116/141
If something goes wrong in a sub-transaction, you can roll back just that sub-transaction, but the main transaction can continue or be rolled back
entirely.
A private workspace is used here where each sub-transaction works with its own copy of the data.
2. Distributed Transactions:
Distributed transactions happen when a transaction involves multiple systems or databases across different locations.
Example: You are buying a house, and this requires checking the property database, verifying bank information from another database, and
confirming the legal paperwork from yet another database.
The challenge here is that all these systems must either succeed together or fail together to maintain consistency.

Distributed Transaction System Architecture:

In distributed systems, multiple data repositories (databases or servers) are connected over a network.
Each site (server) involved in the transaction has a local transaction manager that keeps logs and coordinates the transaction's completion.

Roles in Distributed Transactions:

Transaction Coordinator: This is the entity that coordinates and manages the whole transaction across different sites. It decides when to commit or
abort the transaction.
Transaction Manager: Each site’s local manager checks if the transaction can be committed. It sends messages to the coordinator and ensures that all
local transactions are in sync.

Concurrency Control in Distributed Systems:

Concurrency control is important to ensure that multiple transactions can occur at the same time without causing issues (like two people trying to buy the
same house). It ensures data consistency and atomicity while also improving performance.

Some goals of concurrency control:

Make sure that failures do not affect the transaction process.


Allow multiple operations to happen at the same time without data issues.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 117/141
Ensure that transactions can be completed in finite time.

Commit Protocols:

Commit protocols are mechanisms that help ensure transactions in distributed systems are committed correctly.
The Two-Phase Commit (2PC) protocol is widely used in distributed transactions. It guarantees that a transaction is either committed at all sites or aborted
at all sites, ensuring consistency across the system.

Phase 1:

1. Coordinator logs the prepare command: It sends a “prepare to commit” message to all participating sites (servers).
2. Sites prepare: Each site prepares the transaction and enters a state called pre-committed.
If a site is ready to commit, it sends a “Ready” message to the coordinator.
If a site is not ready, it sends a “Don’t Commit” message.

Phase 2:

1. Coordinator receives responses: If all sites are ready (received “Ready” messages), the coordinator sends a “commit” message to all sites.
2. Commit or Abort:
If all sites are ready, they commit the transaction and make the changes permanent.
If any site reports that it can’t commit, the coordinator sends an abort message, and all sites roll back the transaction.

Summary of the Two-Phase Commit Process:

1. Phase 1:
Coordinator sends a “prepare” message to each site.
Sites either agree to commit or decide not to commit, sending responses back.
2. Phase 2:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 118/141
Coordinator sends commit or abort based on responses.
If all sites agree to commit, the transaction is finalized and committed.
If any site disagrees, the coordinator sends an abort message, and the transaction is canceled.

In Conclusion:

Transactions, especially in distributed systems, are crucial for ensuring consistency, reliability, and data integrity. Understanding the ACID properties and
protocols like Two-Phase Commit ensures that even complex, multi-system transactions can complete without leaving data in an inconsistent state. Whether
working with single systems or distributed systems, the goal is to ensure that each operation is treated as atomic (all or nothing), consistent (leaves the
system in a valid state), isolated (doesn’t interfere with others), and durable (persists even after a crash).

CAP Theorem
• The CAP theorem, originally introduced as the CAP principle, can be used to explain some of the
competing requirements in a distributed system with replication.
• It is a tool used to make system designers aware of the trade-offs while
designing “network shared-data systems” or A distributed system that stores
data on more than one node (physical/virtual machines) at the same time
• The three letters in CAP refer to three desirable properties of distributed
systems with replicated data:
• C -Consistency (among replicated copies)
• A -Availability (of the system for read and write operations)
• P-Partition tolerance (in the face of the nodes in the system being
partitioned by a network fault).
• The CAP theorem states that it is not possible to guarantee all three of the
desirable properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication. We can strongly support only two of the three properties
CAP Theorem Example
CLOUD COMPUTING
• Imagine a distributed system consisting of two nodes:
• The distributed system acts as a plain register with the value of
variable X.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 119/141
• There’s a network failure that results in a network partition
between the two nodes in the system.
• An end-user performs a write request, and then a read request.
• Let’s examine a case where a different node of the system
processes each request. In this case, our system has two options:
• It can fail at one of the requests, breaking the system’s availability
• It can execute both requests, returning a stale value from the read request and breaking the system’s
consistency
• The system can’t process both requests successfully while also ensuring that the read returns the latest
value written by the write.
• This is because the results of the write operation can’t be propagated from node A to node B because of
the network partition.
1. Consistency (C)Consistency means that the nodes will have the same copies of a replicated data item visible for various
transactions. It’s the guarantee that every node in a distributed cluster returns the same, most recently updated value of
a successful write at any logical time.
In CAP, the term consistency refers to the consistency of the values in different copies of the same data item in a
replicated distributed system. This can be verified if all reads initiated after a successful write return the same and latest
value at any given logical time.
▪ Performing a read operation will return the value of
the most recent write operation causing all nodes
to return the same data.
▪ A system has consistency if a transaction starts with the
system in a consistent state, and ends with the system in a
consistent state (may have an inconsistent state during a transaction,
but the entire transaction gets rolled back if there is an error).
▪ In the image, we have 2 different records (“Bulbasaur” and “Pikachu”) at
different timestamps. The output on the third partition is “Pikachu”,
the latest input although it will need time to update and be Available.
CAP Theorem
CLOUD COMPUTING
CAP Theorem
CLOUD COMPUTING
2. Availability (A)Availability means that each read or write request for a data item from a client to a
node will either be processed successfully or will receive a message that the operation cannot be
completed (if not in failed state).

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 120/141
▪ All working nodes in the distributed
system return a valid response for any
request, without exception
▪ Achieving availability in a distributed system
requires that the system remains operational
100% of the time, which may need that we have
“x” servers beyond the “n” servers serving our
application
CLOUD COMPUTING
CAP Theorem
3. Partition Tolerance (P) Partition tolerance requires that a system be able to re-route a
communication when there are temporary breaks or failures in the network (network partitions).
▪ It means that the system continues to function and upholds its consistency guarantees in spite of
network partitions.
▪ Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once
the partition heals.
▪ This condition states that the system continues to run, despite the number
of messages being delayed by the network between nodes.
▪ A system that is partition-tolerant can sustain
any amount of network failure that doesn’t
result in a failure of the entire network.
▪ Data records are sufficiently replicated across
combinations of nodes and networks to keep the
system up through intermittent outages.
CLOUD COMPUTING
CAP Theorem : Tradeoffs between Requirements (Illustrations)
1. Availability and Partition-Tolerant (Compromised Consistency): Say you have two nodes and the link
between the two is severed. Since both nodes are up, you can design the system to accept requests on
each of the nodes, which will make the system available despite the network being partitioned.
However, each node will issue its own results, so by providing high availability and partition tolerance
you’ll compromise consistency.
2. Consistent and Partition-Tolerant (Compromised Availability): Say you have three nodes and one
node loses its link with the other two. You can create a rule that, a result will be returned only when a
majority of nodes agree. So In-spite of having a partition, the system will return a consistent result, but
since the separated node won’t be able to reach consensus it won’t be available even though it’s up.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 121/141
3. Consistent and Available (Compromised on a Partition-Tolerance): Although, a system can be
both consistent and available, but it may have to block on a partition.
CLOUD COMPUTING
CAP Theorem : Illustrative Examples of products/Application categories to choose
CLOUD COMPUTING
CAP Theorem
• The “Pick Two” expression of CAP opened the minds of designers to a wider range of
systems and tradeoffs
• CAP is NOT a choice “at all times” as to “which one of the three guarantees to abandon”.
In fact, the choice is between consistency and availability only when a network partition or
failure happens. When there is no network failure, both availability and consistency can be
satisfied
• Products like DB e.g. Casandra supports AP but provides eventual consistency by allowing
clients to write to any nodes at any time and reconciling inconsistencies as quickly as
possible
• Reference: https://www.youtube.com/watch?v=9uCP3qHNbWw
CLOUD COMPUTING
CAP Theorem and some of the criticisms
• Although the Theorem doesn't specify an upper bound on response time for availability, in
practice, there's exists a timeout. CAP Theorem ignores latency, which is an important
consideration in practice. Timeouts are often implemented in services. During a partition, if we
cancel a request, we maintain consistency but forfeit availability. In fact, latency can be seen as
another word for availability.
• In NoSQL distributed databases, CAP Theorem has led to the belief that eventual consistency
provides better availability than strong consistency. It could be considered as an outdated notion.
It's better to factor in sensitivity to network delays.
• CAP Theorem suggests a binary decision. In reality, it's a continuum. It’s more of a trade-off. There
are different degrees of consistency implemented via "read your writes, monotonic reads and
causal consistency".
CLOUD COMPUTING
CAP Theorem and how do you use them when making decisions
• CAP Theorem can feel quite abstract, but both from a technical and business perspective the trade
offs will lead to some very important questions which has practical, real-world consequences.
• These questions would be
• Is it important to avoid throwing up errors to the client?

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 122/141
• Or are we willing to sacrifice the visible user experience to ensure consistency?
• Is consistency an actually important part of the user’s experience
• Or can we actually do what we want with a relational database and avoid the need for partition
tolerance altogether?
• All of these are ultimately leading to user experience questions.
• This will need understanding of the overall goals of the project, and context in which your database
solution is operating. (E.g. Is it powering an internal analytics dashboard? Or is it supporting a
widely used external-facing website or application?)
CLOUD COMPUTING
CAP Theorem and its relationship to Cloud Computing
• Microservices architecture for applications is popular and prevalent on both cloud servers and on
premises data centers
• We have also seen that microservices are loosely coupled, independently deployable application
components that incorporate their own stack, including their own database and database model,
and communicate with each other over a network.
• Understanding the CAP theorem can help you choose the best database when designing a
microservices-based application running from multiple locations.
E.g. If the ability to quickly iterate the data model and scale horizontally is essential to your
application, but you can tolerate eventual (as opposed to strict) consistency, an database like
Cassandra or Apache CouchDB (supporting AP - Availability and Partitioning) can meet your
requirements and simplify your deployment. On the other hand, if your application depends
heavily on data consistency—as in an eCommerce application or a payment service—you might opt
for a relational database like PostgreSQL.
CLOUD COMPUTING
CAP Theorem Conclusion
• Distributed systems allow us to achieve a level of computing power and availability that were
simply not available in the past.
• Our systems have higher performance, lower latency, and near 100% up-time in data centers that
span the entire globe.
• These systems of today are run on commodity hardware that is easily obtainable and configurable
at affordable costs.
• The disadvantage to it though is Distributed systems are more complex than their single-network
counterparts.
• Understanding the complexity incurred in distributed systems, making the appropriate trade-offs
for the task at hand (CAP), and selecting the right tool for the job is necessary with horizontal

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 123/141
scaling

Explain the above for newbie in depth with above word and letters and sentences

Understanding the CAP Theorem in Simple Terms

The CAP Theorem is a concept in distributed systems, which are systems where data is stored across multiple machines or locations, often for reasons like
improving speed, redundancy, or fault tolerance. In these systems, we often face trade-offs between three important goals when designing the system:
Consistency, Availability, and Partition Tolerance. These are the three properties represented by the letters C, A, and P in the CAP theorem.

The Three Key Properties of the CAP Theorem:

1. Consistency (C):
Consistency means that all the nodes (servers or locations) in a distributed system will have the same copy of data at any given time.

In simple terms, when one node updates data, all other nodes will immediately reflect this update.
Example: Think of a shared document where multiple users can edit it. Once one person saves a change, everyone else should see the updated
version immediately.
2. Availability (A):
Availability means that the system is always available for reading and writing data.
Even if some part of the system fails, the system should still allow requests (either read or write) to be processed.
Example: A website should always be available, even if one of its backend systems fails.
3. Partition Tolerance (P):
Partition tolerance means the system can handle "partitions," or temporary communication breakdowns between nodes. Even if some nodes can't
communicate with others, the system should continue to work.
Example: If two parts of a system (say two data centers) lose connection with each other, the system should still work, and once the connection is
restored, it should sync up without losing any data.

The Core Idea of the CAP Theorem:

The CAP Theorem says that in a distributed system, you can choose to guarantee only two of these three properties at the same time:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 124/141
1. Consistency + Availability, but not Partition Tolerance
2. Consistency + Partition Tolerance, but not Availability
3. Availability + Partition Tolerance, but not Consistency

Let’s explore this with examples.

CAP Theorem Example: A Distributed System with Two Nodes

Let’s imagine a simple distributed system with two nodes (servers or locations). This system is like a shared register that keeps the value of a variable (let’s
call it X).

Scenario 1: Network Partition

Suppose there's a network failure that separates the two nodes. They can no longer communicate with each other.
Now, if someone makes a write request (changes the value of X), and then a read request comes in, here’s what can happen:
Option 1: The system might fail the read or write request because the system is partitioned, meaning one node cannot communicate with the
other. This would break the system’s availability.
Option 2: The system might process both requests but return an outdated or stale value (because the change hasn't reached both nodes), thus
breaking consistency.

This shows how network partitions affect the system’s ability to maintain consistency and availability.

Let’s Break Down the Properties:

1. Consistency (C):
Imagine there are two copies of the same data (like two copies of the same document on different servers). If one copy is updated, the system
must ensure all copies are updated immediately to reflect that change.
A system is consistent if every read request after a write returns the most recent write, ensuring all nodes are in sync.
2. Availability (A):
If you make a request (read or write) to a distributed system, it should always respond, even if some parts of the system are down.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 125/141
A system is available if it does not fail to respond, even if part of the system is unreachable.
3. Partition Tolerance (P):
Even if there’s a network failure (communication breakdown) between some parts of the system, the system should continue operating without
crashing.
A system is partition-tolerant if it can still function, even with temporary communication failures.

Trade-offs and Why You Can’t Have All Three:

The CAP Theorem teaches us that when designing distributed systems, you must trade-off between two properties when there’s a network partition. You
can’t have all three (Consistency, Availability, and Partition Tolerance) at once.

1. Availability + Partition Tolerance (compromising Consistency):

Imagine two nodes are separated by a network failure. To maintain availability (the system must continue processing requests), each node might
respond with different data. This can break consistency since each node might have different values of X.
Trade-off: You ensure the system is still available, but it might return outdated or inconsistent data.

2. Consistency + Partition Tolerance (compromising Availability):

Let’s say one node gets disconnected from others. To maintain consistency, the system will only return a result when the majority of nodes agree
(and the system has enough communication to sync up). This may mean that one node can’t serve requests because it can’t get consensus with others.
Trade-off: You ensure the system remains consistent, but it may not always be available for requests when nodes can’t communicate.

3. Consistency + Availability (compromising Partition Tolerance):

In this case, the system will maintain consistent data and always be available, but it might not handle network partitions well. If the system is
partitioned, some nodes might be unavailable until the partition is healed.
Trade-off: The system is both consistent and always available, but it can’t handle network partitions properly.

Real-World Examples of CAP in Action:

1. Cassandra (NoSQL Database):

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 126/141
Supports Availability and Partition Tolerance (AP): Cassandra is designed to stay available even if some nodes can’t communicate with others. It
offers eventual consistency, meaning it will eventually sync all nodes, but doesn’t guarantee immediate consistency across all nodes.
2. Relational Databases (like PostgreSQL):
These databases generally focus on Consistency and Availability. If the system goes down, it prioritizes maintaining data accuracy over
availability, but it may not handle network partitions as well.

Criticisms of CAP Theorem:

Timeouts and Latency: The theorem doesn’t take into account response time (latency). If the system takes too long to respond during a partition, it
might sacrifice availability, even if it tries to maintain consistency.
Eventual Consistency: Some databases (like NoSQL databases) focus on eventual consistency instead of strong consistency, meaning they may return
slightly outdated data but will sync once the partition is resolved.
It's not a strict binary: In practice, CAP isn’t about an all-or-nothing decision. Systems can have varying levels of consistency, availability, and
partition tolerance, and often adjust based on the situation.

How Does CAP Theorem Relate to Cloud Computing?

Cloud computing and microservices architectures often rely on distributed systems. Understanding the CAP theorem helps in choosing the right database or
storage solution, depending on whether you need:

Availability (e.g., applications that need to be always online),


Consistency (e.g., applications that need strict data correctness),
or Partition Tolerance (e.g., systems that need to continue functioning even when some nodes are down).

Conclusion:

The CAP Theorem helps you understand the trade-offs when designing distributed systems. Depending on the needs of your application (like speed,
consistency, or fault tolerance), you might prioritize one or two of the three properties at the expense of the third. It's all about balancing these trade-offs to
ensure that your system functions in the way you need it to.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 127/141
Storage Virtualization Case Studies in Different Contexts
This presentation showcases
Five contexts where storage virtualization is used in recent years
We discuss
The industry background
Problem being addressed
Solution approach
Outcome
CLOUD COMPUTING
Consolidated Communications Holdings
Industry Background:
This 125 years old company, physically spans over 50,000 route miles, 20+ states, and
1.8 million connections.
The descison to go for new storage solution came up after facing an email outage that
leaked its residential email clients.
Industry: Communications
Storage virtualization provider: DataCore
CLOUD COMPUTING
Solution
▪ The company employed
○ a SAN solution with attention to flexibility and performance
○ To keep up with the firm’s requirements for caching.
▪ The firm uses DataCore’s storage solution in two forms:
○ a traditional storage product consisting of hard drives
○ solid-state drives (SSDs)
▪ A hyperconverged model with greater storage through virtualized storage
controllers.
CLOUD COMPUTING
Outcomes
● Consistent 100% uptime delivered over a period of 10years
● Reduced storage related spend
● Reduced time, effort, cost spent on preventive meintenance
CLOUD COMPUTING
Krishna Institute Of Medical Sciences

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 128/141
With a network of over 13 hospitals and 4,000 beds, Krishna Institute of Medical
Sciences (KIMS) provides comprehensive healthcare services across cardiac
sciences, oncology, neurosciences, organ transplantation, and 20 other specialities.
Industry: Health care
Storage virtualization provider: VMware and VMware vSAN
CLOUD COMPUTING
Problem Context
As KIMS Hospitals expanded, the workload increased threefold.
Due to insufficient information on the performance of the existing storage
infrastructure, there was an increase storage options they considered (onsite, ad
hoc, and pay-on-demand cloud storage options)
With storage option, the management and maintenance of storage infrastructure
became ever more complicated.
Some of the applications were functioning on outdated software
Made wayvulnerable to ransomware attacks.
KIMS Hospitals needed a robust and swift IT infrastructure that could manage big
data within complex systems and numerous applications
CLOUD COMPUTING
Solution
KIMS Hospitals deployed the VMware vSAN solution with Dell hardware to meet its
existing needs for redundancy, scalability and cost efficiency.
VMware’s virtualized storage, combined with VMware vRealize Operations optimized
data exchange between networks and applications.
The modernized hyperconverged infrastructure (HCI) allowed IT administrators at
KIMS Hospitals to have a virtual data plane with real-time analytics on virtual machines
(VMs) storage attributes, including performance, capacity, and availability.
Better control and knowledge of workload and performance
CLOUD COMPUTING
Outcomes:
● 80% of workloads moved from AWS to HCI, saving thousands in billings
● 98% VMware vSAN issues resolved in the pre-development phase
● No downtime encountered during maintenance activities
CLOUD COMPUTING
Seneca Family of Agencies
Founded in 1985, the community caters to over 18,000 young people and

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 129/141
families across California and Washington State
Helps children and families see through the hardest times in their lives
offers permanency, mental health services, education, and juvenile justice.
Industry: Mental health services
Storage virtualization provider: StarWind and StarWind HCA
CLOUD COMPUTING
Problem Context
They were using a mix of dated VMware hosts and a Hyper-V cluster for its
virtual environment.
The high incidences of VMware cluster renewal and Hyper-V’s slow performance
compelled the organization to move toward hyperconvergence.
The existing IT infrastructure was limited by storage problems and required
extensive patching times done off-hours.
CLOUD COMPUTING
Solution
They implemented the HCA and ran high availability with simply two nodes with
StarWind HCI Appliance (HCA)
Outcomes:
● Patching in cluster now done during business hours
● Quick migration across hosts
● Increased storage capacity
CLOUD COMPUTING
Grupo Alcione
With over 30 years of experience, Grupo Alcione specializes in electrical equipment
and supplies.
Today, the company has expanded to 22 branches, 800 employees, and 11 Mexican
states.
• Industry: Manufacturing
• Storage virtualization provider: Nutanix and Nutanix AHV
Electronics is a competitive industry; to realize better customer conversions,
CLOUD COMPUTING
Problem
• TGrupo Alcione wanted quickly access inventory in real-time
○ To update customers about pricing and budgeting information in real-time
without delays

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 130/141
• They neededa high-performance, multicloud system with a lower-cost
investment,
CLOUD COMPUTING
Problem
• Nutanix AHV built Alcione’s infrastructure on nine nodes and two clusters.
• The first cluster, having six nodes, works as a production environment.
• The second cluster, with three nodes, serves the disaster recovery plan (DRP)
○ Here iImportant operational virtual servers replicate every hour
■ ensured critical service availability.
CLOUD COMPUTING
Outcomes:
● 30% savings on data center expenses
● Quick, real-time access to inventory
● Higher availability of critical services and improved reliability
● Increased operational efficiency
● Reduction in frequent customer support calls
CLOUD COMPUTING
Daegu Metropolitan City
This is not a strict storage virtualization context. It’s more like cloud adoption
The third largest city in the Republic of Korea, Daegu, has a population of 2.5
million residents with eight administrative districts and 13,000+ public employees.
In 2015, the government decided to work on a three-phase project to create the
Daegu
Industry: Government
Storage virtualization provider: Red Hat and Red Hat Virtualization
CLOUD COMPUTING
Project City Cloud (D-Cloud)
An ambitious project to offer important services like public notices, healthcare, and
financial assistance in a timely manner.
The city government needed to update its decade-old IT infrastructure along with
hardware sourced from different places.
Daegu dealt with higher operational costs, primarily because of multiplicity of
hardware and software solutions.
CLOUD COMPUTING
Solution Approach

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 131/141
The city initially deployed Red Hat Enterprise Linux, Red Hat JBoss Enterprise
Application Platform (EAP), and Red Hat Virtualization to create its new cloud
based service environment.
Red Hat Enterprise Linux created the foundation for Daegu to start building the
hybrid cloud infrastructure.
Once the initial framework was laid out, the virtualization software laid out a
central infrastructure for virtualized and cloud-native workloads.
Red Hat JBoss EAP then helped Daegu to overtake the managerial part, ensure
user security, and attain high performance at scale.
CLOUD COMPUTING
Outcomes:
● Reduced operating costs by 36%
● Transformation of 50% of the legacy IT systems into cloud-based
environment
● Simplified infrastructure operations and management
● Created foundation for easier resident access to information and services
THANK YOU
Prafullata Kiran Auradkar
Department of Computer Science and Engineering
[email protected]
CLOUD COMPUTING
Review slides on Unit-3: Distributed Storage
Prof. P Kokila
Department of Computer Science and Engineering
CLOUD COMPUTING
Introduction to Storage
•Storage : Understanding Data Storage
oFile Storage
oBlock Storage
oObject Storage
•Storage Architectures
oDirectly Attached Storage (DAS)
oNetwork Attached Storage (NAS)
oStorage Area Network (SAN)
•Typical Architecture of a Storage Device/Subsystem

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 132/141
oRAID – Redundant Array of Independent Disks
oMagnetic Disk Structure and Disk Latency
CLOUD COMPUTING
Cloud Storage and Enablers for Storage Virtualization
• What is Cloud Storage
o Complex service, simple storage
o Analogy to cloud storage
o Cloud Storage Service
o Architecture of a cloud storage platform
• What Constitutes Cloud Storage
• Cloud Storage Enablers
o Enablers for Storage Virtualization – File Systems
o Enablers for Storage Virtualization– Logical Volume Manager (LVM)
• Thin Provisioning or Virtual Provisioning
CLOUD COMPUTING
Storage Virtualization
• What is Storage – Virtualization
• Categories of Storage Virtualization
o File level Virtualization
⮚ Distributed File System
⮚ Lustre
⮚ GlusterFS
o Block Virtualization
⮚ Host-Based
⮚ Storage Level
⮚ Network level
CLOUD COMPUTING
Object Storage
• What is Object Storage, Example of Object storage
• Illustration – Amazon Simple Storage Service (S3)
o Organizing Data In S3
⮚ Buckets, Objects and Keys
⮚ Security
⮚ Data Protection
⮚ Large Objects and Multi-part Uploads

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 133/141
• Another Illustration of Object Storage
o Open Stack Swift
o NoSQL Database - DynamoDB
CLOUD COMPUTING
Partitioning and Consistent Hashing
• What is Partitioning
• Goals of Partitioning
• Different Approaches of Partitioning
o Vertical Partitioning
o Workload Driven Partitioning
o Partitioning by Random Assignment
o Horizontal Partitioning
⮚ Partitioning by Key Range
⮚ Partitioning using the Schema
⮚ Partitioning using Graph Partitioning
⮚ Partitioning using Hashing
✔ Partitioning by Hash of Key
✔ Distributed Hashing
✔ Distributed Hashing – Rehashing Problem
• Consistent Hashing
• Partitioning and Secondary Indexes
o Document-based partitioning
o Term-based partitioning
CLOUD COMPUTING
Rebalancing Partitions and Request Routing
• Skewed Workloads and Relieving Hot Spots
• Rebalancing Partitions
• Strategies for Rebalancing Partitions
o Hash based rebalancing : hash mod N
o Fixed number of partitions
o Dynamic Partitioning
o Partitioning proportionally to nodes
• Request Routing
o Approach 1 - Round-robin load balancer
o Approach 2 - Partition-aware load balancer.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 134/141
o Approach 3 - without any intermediary
o Approach 4 : ZooKeeper
CLOUD COMPUTING
Replication
• What is Replication
• Popular algorithms for replicating changes between nodes
o Leader based or single-leader based replication
⮚ Synchronous Versus Asynchronous Replication
⮚ Implementation of Replication Logs
⮚ Potential Issues and their Handling
✔ Follower Failure: Catch-up recovery
✔ Leader Failure - Failover
✔ Replication Lag
• Identification of Inconsistencies due to Replication Lag
• Replication Lag – Potential Solutions
o Multi-Leader Replication
⮚Multi-Leader Replication Challenges
⮚Single-Leader vs. Multi-Leader Replication
⮚Multi-Leader Replication Topologies
CLOUD COMPUTING
Replication
• Popular algorithms for replicating changes between nodes (Cont.)
o Leaderless replication
⮚ Writes with Leaderless Replication
⮚ Reads with Leaderless Replication
⮚ Potential Pitfalls – 1 : Writing to the Database When a Node Is Down
⮚ Pitfalls – 2 : Read repair
⮚ Pitfalls – 3 : Detecting Concurrent Writes
⮚ Quorum Consistency
⮚ Monitoring staleness
⮚ Multi-datacenter operation
CLOUD COMPUTING
Consistency Models
• What is Consistency and Consistency Guarantee
• Consistency Models – Different Types

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 135/141
o Sequential Consistency (Lamport)
o Causal Consistency
o PRAM (Pipelined Random-Access Memory) consistency
o Strict Consistency (also called Strong consistency or Linearizability)
⮚ Linearizability
CLOUD COMPUTING
CAP Theorem
• Definition of CAP Theorem
• CAP Theorem Example
• CAP refer to three desirable properties
o C - Consistency (among replicated copies)
o A -Availability (of the system for read and write operations)
o P - Partition tolerance (in the face of the nodes in the system being partitioned by a
network fault).
• CAP Theorem : Tradeoffs between Requirements (Illustrations)
• CAP Theorem and some of the criticisms
• CAP Theorem and how do you use them when making decisions
• CAP Theorem and its relationship to Cloud Computing
• CAP Theorem Conclusion
CLOUD COMPUTING
Distributed Transactions
• What is meant as Transactions
• Properties of Transactions (ACID)
• Different types of transactions :
o Nested Transactions
o Distributed Transactions
⮚ Distributed Transaction System Architecture
• Concurrency Control
• Commit Protocols
• Two Phase Commit Protocol

explain for newbie each and everyy . Explain the above for newbie in depth with above word and leters and sentences

Sure! I'll explain each part in detail, breaking it down for a complete beginner. Let's go step by step.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 136/141
1. Storage Virtualization:

Storage virtualization is the process of combining multiple physical storage devices (like hard drives) into one virtualized storage pool. This pool can then be
accessed by applications and users without worrying about where the data is physically stored. The purpose is to simplify data management and improve
storage efficiency.

Example: Think of it like having a bunch of drawers for your clothes, but instead of sorting clothes into different drawers, you have one big drawer that
automatically organizes and stores everything inside. You don't need to know which drawer has which shirt; you just access your clothes through that one big
drawer.

2. Case Studies in Storage Virtualization:

Let's now look at different real-world examples of how storage virtualization is used:

Case 1: Consolidated Communications Holdings

Industry Background: This company is 125 years old and provides communications services across 20+ states in the USA.
Problem: They faced a major email outage, leaking residential email client information. This incident showed their storage infrastructure was not
reliable enough.

Solution: They adopted a SAN solution (Storage Area Network) with flexibility and performance in mind, using DataCore’s storage products,
combining both hard drives and solid-state drives (SSDs).
Outcome: The company achieved 100% uptime for 10 years, reduced storage-related spending, and cut down on preventive maintenance costs.

Case 2: Krishna Institute of Medical Sciences (KIMS)

Industry Background: KIMS is a large hospital network offering healthcare services across multiple specialties.
Problem: Their workload increased as they expanded, leading to slow performance due to old storage systems, which also left them vulnerable to
ransomware.

Solution: KIMS implemented VMware vSAN (Virtual Storage Area Network) combined with Dell hardware, modernizing their IT infrastructure.
Outcome: They moved 80% of workloads from AWS to their own hyper-converged infrastructure (HCI), saving on cloud costs and resolving 98% of
VMware vSAN issues before development.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 137/141
Case 3: Seneca Family of Agencies

Industry Background: Provides mental health services across California and Washington State.
Problem: The organization was using outdated VMware and Hyper-V systems, which led to performance problems and required extensive downtime for
maintenance.
Solution: They switched to StarWind HCI (Hyper-Converged Infrastructure) to run high-availability systems with just two nodes.

Outcome: Maintenance was done during business hours, patching was faster, and the organization had more storage capacity.

Case 4: Grupo Alcione

Industry Background: A company specializing in electrical equipment, operating in Mexico.

Problem: They needed quick access to inventory data in real-time for pricing and budgeting.

Solution: They used Nutanix AHV to create a highly available, multicloud system with a lower cost.
Outcome: They saved 30% on data center expenses and improved operational efficiency and reliability.

Case 5: Daegu Metropolitan City

Industry Background: Daegu is the third-largest city in South Korea, with 2.5 million residents.

Problem: The city needed to update their old IT infrastructure and reduce operational costs.
Solution: They adopted Red Hat Virtualization to build a hybrid cloud environment.

Outcome: They reduced operating costs by 36% and transitioned 50% of their legacy systems to the cloud.

3. Understanding Storage Types:

File Storage:

File storage organizes data in a hierarchical structure, like folders and files on your computer. It's simple and easy to use.

Block Storage:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 138/141
This breaks data into fixed-size blocks and stores them across disks. It's more efficient for databases and systems that require fast access to data.

Object Storage:

This is a system that stores data as objects (like photos, videos, or large files) instead of files or blocks. It's scalable and is used for large amounts of
unstructured data, like in cloud storage services (e.g., Amazon S3).

Storage Architectures:

DAS (Direct Attached Storage): Storage that is directly attached to a server without a network in between.
NAS (Network Attached Storage): Storage connected to a network that multiple users or computers can access.

SAN (Storage Area Network): A specialized network that provides high-speed access to consolidated, block-level data storage.

4. Cloud Storage & Virtualization:

Cloud storage lets you store data on remote servers accessed over the internet. The cloud provider takes care of the physical hardware.
Enablers for Storage Virtualization include File Systems (like NTFS) and Logical Volume Managers (LVM), which manage the storage across physical
devices.

5. Partitioning & Hashing:

Partitioning is dividing data into smaller chunks to improve performance and scalability. Different types include:

Vertical Partitioning: Dividing by columns (data types).

Horizontal Partitioning: Dividing by rows (records).

Consistent Hashing is a technique used to evenly distribute data across partitions to avoid "hot spots" (overloaded nodes).

6. Replication & Consistency Models:

Replication is the process of copying data across multiple nodes to ensure high availability and fault tolerance. Common replication types are:

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 139/141
Leader-based Replication: One node is the leader, and others follow its updates.

Leaderless Replication: All nodes can accept write requests, but care must be taken to resolve conflicts.
Consistency Models: These ensure that users always see the most recent version of the data.
Strong Consistency: Guarantees that every read sees the most recent write.

Eventual Consistency: Guarantees that, eventually, all copies of data will become consistent, but there might be a temporary mismatch.

7. CAP Theorem (Consistency, Availability, Partition Tolerance):

The CAP Theorem states that in a distributed system, you can only guarantee two out of the following three:

Consistency (C): All nodes see the same data at the same time.
Availability (A): Every request will receive a response, even if it’s stale.

Partition Tolerance (P): The system can continue working even if there are network issues.

Example: In case of a network failure (partition):

CA: The system sacrifices consistency to remain available.


CP: The system sacrifices availability to maintain consistency.

AP: The system sacrifices consistency for availability.

8. Distributed Transactions & ACID:

Transactions are operations that ensure data consistency and correctness.

ACID stands for:

Atomicity: The transaction is all or nothing.


Consistency: The database is always in a consistent state.

Isolation: Transactions do not interfere with each other.


Durability: Once a transaction is committed, it can't be undone.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 140/141
Conclusion:

This is an overview of key concepts in cloud storage, virtualization, and distributed systems. These systems allow businesses to handle vast amounts of data
efficiently, but require careful design to balance performance, consistency, and cost.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 141/141

You might also like