DISTRIBUTED TRANSACTIONS IN DISTRIBUTED DATABASE SYSTEM
Introduction
In earlier computer systems, data was stored and processed at a single central computer. As
technology advanced and organizations grew larger, the need for storing large amounts of
data and providing fast access to users at different locations increased. To meet these
requirements, databases were distributed across multiple computers connected by a network.
Such systems are called Distributed Database Systems.
In a distributed database system, data is not stored in one location but is spread across several
sites. Each site has its own local database system that manages data independently while
cooperating with other sites. Users can access data from any site as if it were stored locally,
even though it may actually be located in another place.
A transaction is a logical unit of work that performs one or more operations such as
insertion, deletion, or update on data. When a transaction involves data from only one
database, it is called a local transaction. However, when a transaction needs to access or
modify data stored at multiple sites, it becomes a distributed transaction.
A distributed transaction may involve reading data from one site, updating data at another
site, and confirming the result at a third site. This makes management of transactions more
difficult because each participating site must coordinate with others to ensure that all
operations are successfully completed.
The biggest challenge in a distributed transaction system is maintaining data consistency. It
must be ensured that either all databases involved reflect the changes made by the transaction
or none of them do. Partial execution may lead to data inconsistency, which can cause serious
problems such as incorrect balances, lost data, or system failure.
Another important issue is fault tolerance. Failures can occur due to:
System crash
Network failure
Disk errors
Power failure
Coordinator failure
When a failure happens at any site, the transaction must either restart or rollback completely
from all sites. This requires strong failure-handling mechanisms.
Distributed transactions also raise issues related to concurrency control because multiple
users may access the same data at the same time from different locations. Without proper
control, this can lead to incorrect results. Therefore, distributed databases use locking
techniques and timestamp ordering to avoid conflicts.
To solve these challenges, distributed transaction systems use:
Commit protocols
Logging techniques
Recovery algorithms
Transaction coordinators
Failure detection mechanisms
The aim of distributed transactions is to make a group of separate computers behave like a
single unified system. From the user’s point of view, the transaction should appear as if it
were executed on a single system, even though it is actually running on multiple servers
located at different places.
Thus, distributed transactions allow organizations to:
Process large amounts of data efficiently
Share resources across locations
Improve availability and reliability
Support global business operations
Achieve better performance and scalability
Definition of Distributed Transaction
A Distributed Transaction is a type of transaction that performs operations on data stored at
more than one geographic location in a distributed database system. In such a transaction,
databases located on different computers, servers, or networks take part in a single logical
unit of work.
Formal Definition:
A distributed transaction is a transaction that accesses and updates data on more than one site
in a distributed database system and guarantees ACID properties across all participating
locations.
In simple terms, when one transaction performs insert, update, or delete operations on
multiple databases situated at different sites, it is called a distributed transaction.
For example, in a banking system, when money is transferred from one account to another,
one database may deduct the amount while another database credits it. Since more than one
database is involved, the transaction becomes distributed.
The most important feature of a distributed transaction is atomic execution. This means the
transaction must execute completely at all sites or not execute at all. Partial execution is not
allowed. If one site fails while performing an operation, all other completed operations at
other sites must be undone through a rollback process.
Distributed transactions require special software called a Transaction Coordinator, which
controls the execution process and ensures that all sites agree on the final result. The
coordinator sends instructions, collects responses, and makes the final commit or rollback
decision.
Another important objective of distributed transactions is maintaining consistency. All
databases must reflect the same result. If some databases commit while others fail, it may
lead to data corruption and system failure. Therefore, strict coordination is necessary.
Distributed transactions also ensure isolation, so that multiple transactions running at the
same time do not interfere with each other. Locks or timestamps are used to control
concurrent execution.
Finally, durability ensures that once a distributed transaction is successfully completed, the
changes remain saved permanently, even if a system crash occurs.
Characteristics of Distributed Transactions
Distributed transactions have the following features:
1. Multiple databases are involved
2. Transaction is executed at different locations
3. Coordinator controls the transaction
4. Communication between sites is required
5. Atomicity must be maintained
6. Recovery mechanism is required
7. All sites must agree before commit
8. Locking is used to avoid conflicts
9. Failure at one site affects all
10. Decision must be common at all nodes
Types of Transactions
In a distributed database environment, transactions are classified based on the number of
database sites they access. Mainly, there are two types of transactions:
(a) Local Transaction
A Local Transaction is a transaction that accesses and updates data stored at only one site in
a distributed database system. All operations of the transaction such as insert, update, delete,
and read are performed within a single database.
In this type of transaction, no other database participates, and there is no communication with
other systems over the network. The entire transaction is completed locally using the database
management system of that particular site.
Characteristics of Local Transactions:
Executes at only one database site
No coordination with other sites required
Faster execution
Less communication cost
Easier recovery
Lower complexity
Example of Local Transaction:
Updating the salary of an employee stored in a local database system is a local transaction
because it does not involve any other database.
(b) Distributed Transaction
A Distributed Transaction is a transaction that accesses and updates data located on more
than one site in a distributed database system. It involves coordination among multiple
database servers to ensure correctness and consistency.
In a distributed transaction, each participating site executes part of the transaction and
cooperates with other sites through a transaction coordinator. All sites must successfully
complete their part for the transaction to commit successfully.
If any site fails while executing its operation, all other sites must rollback their changes to
maintain consistency.
Characteristics of Distributed Transactions:
Involves multiple sites
Requires data communication across network
Coordinator controls execution
Commit decision must be common
More complex than local transaction
Needs commit protocols
High reliability mechanisms required
Example of Distributed Transaction:
In an online banking system, when money is transferred from one account to another across
different banks, deducting money from one account and crediting it to another happens at two
different sites. This is a distributed transaction.
Commit Protocols in Distributed Transactions
Meaning of Commit Protocol
In a distributed database system, a single transaction may involve several databases located at
different sites. Each site performs a part of the transaction independently. The major
challenge is to ensure that all participating databases reach the same final decision about the
transaction.
The final decision can only be one of two options:
Commit the transaction
Rollback the transaction
A Commit Protocol is a technique or procedure used to coordinate all the participating sites
and make sure that every site executes the same final action at the end of the transaction.
Definition of Commit Protocol
A commit protocol is a structured set of rules that coordinates multiple database sites to
determine whether a distributed transaction should be committed or rolled back, while
ensuring atomicity and consistency.
Purpose of Commit Protocol
The main goals of a commit protocol are:
1. To guarantee that all sites come to the same decision
2. To ensure that partial execution never occurs
3. To maintain data consistency
4. To preserve atomicity
5. To detect failures and take recovery actions
6. To handle network breakdowns
7. To prevent data corruption
8. To ensure reliable transaction processing
Why Commit Protocol is Required
In distributed systems, failures can occur at many levels:
Node failure
Message loss
Network failure
Crash of coordinator
Disk failure
Without a commit protocol, one site may commit while another may rollback. This would
result in a database inconsistency problem which can seriously damage the integrity of the
system.
Commit protocols prevent such problems and ensure safe and reliable transaction execution.
Key Requirements of Commit Protocols
A good commit protocol must satisfy the following:
All sites must participate equally
Final decision must be unanimous
Communication must be reliable
Logs must be properly maintained
Reasonable performance
Fault tolerance must exist
Must support recovery
Role of Transaction Coordinator
A commit protocol uses a special component known as the Transaction Coordinator.
The coordinator:
Sends messages to all sites
Collects votes
Makes final decision
Broadcasts outcome
Handles failure situations
Commit protocols are essential in a distributed database system. Without them, it is
impossible to ensure that databases located at multiple sites maintain consistency.
Commit protocols make distributed systems reliable and trustworthy by enforcing a single
final decision on all sites and ensuring correct transaction execution even when failures
occur.
Types of Commit Protocols
Mainly two commit protocols are used:
1. Two Phase Commit Protocol (2PC)
2. Three Phase Commit Protocol (3PC)
6.1 Two Phase Commit Protocol (2PC)
The Two Phase Commit Protocol is the most widely used commit protocol in distributed
databases.
There are two main roles:
Coordinator – controls the transaction
Participants – database sites that execute the transaction
Working of Two Phase Commit Protocol
It consists of two stages:
Phase 1: Prepare Phase (Voting Phase)
In this phase, the coordinator sends a message to all participants asking:
"Are you ready to commit?"
Each site:
Executes its operations
Stores updates in log
Checks for errors
Sends response:
o YES → ready to commit
o NO → cannot commit
Phase 2: Commit Phase (Decision Phase)
If all sites reply YES:
Coordinator sends COMMIT command
All sites commit changes
If any site replies NO:
Coordinator sends ROLLBACK command
All sites undo changes
2PC Diagram
COORDINATOR
|
-------------------------
| | |
Site A Site B Site C
Phase 1 → Vote (YES / NO)
Phase 2 → Commit / Rollback
Advantages of Two Phase Commit
Ensures data consistency
Guarantees atomicity
Simple logic
Widely implemented
Disadvantages of Two Phase Commit
Coordinator failure blocks system
Slow process
Network overhead
Participants remain locked
Blocking protocol
6.2 Three Phase Commit Protocol (3PC)
To overcome the blocking problem of 2PC, Three Phase Commit was introduced.
Working of 3PC
It has three steps:
Phase 1: Can Commit Phase
Coordinator asks:
"Can you commit?"
Participants reply YES or NO.
Phase 2: Pre-Commit Phase
Coordinator tells all sites to get ready to commit.
Participants:
Save data in stable storage
Prepare for final commit
Phase 3: Do Commit Phase
Coordinator sends final commit instruction.
All participants commit the transaction.
Advantages of Three Phase Commit
No blocking problem
Better failure handling
Coordinator failure does not stop system
Disadvantages of Three Phase Commit
More messages
More time
Complex design
High communication cost
CONCURRENCY CONTROL IN DDBMS
Meaning of Concurrency Control
Concurrency control in a Distributed Database Management System (DDBMS) is a technique
used to control the simultaneous execution of transactions so that the database remains
correct and consistent.
In distributed systems, many users may access the same data from different locations at the
same time. If concurrency is not properly managed, this may lead to incorrect results, data
loss, or system failure.
Concurrency control ensures that even though transactions run in parallel, the final result is
the same as if they were executed one by one.
Definition
Concurrency Control is the process of managing multiple transactions in a distributed
database system in such a way that data remains accurate, consistent, and reliable.
Need for Concurrency Control in DDBMS
Concurrency control is required because:
Many users may update the same data at the same time
Transactions are executed at different sites
Network delays may cause improper execution
System failures may occur
Resources are shared
Data must not become inconsistent
Isolation property must be maintained
Database integrity must be protected
Problems Without Concurrency Control (With Real-World Examples)
If concurrency control is not applied in a Distributed Database Management System
(DDBMS), multiple transactions may interfere with each other and produce incorrect results.
This can lead to data corruption, wrong reports, and system failure.
The major problems that occur without concurrency control are discussed below:
(a) Lost Update Problem
The lost update problem occurs when two transactions update the same data item at the same
time, and one update is overwritten by the other. As a result, one transaction’s result is lost.
Example (Bank System):
Suppose account balance = ₹5000
Two users withdraw money at the same time:
Transaction T1: Withdraw ₹1000
Transaction T2: Withdraw ₹2000
Both read initial balance = ₹5000.
T1 calculates: 5000 – 1000 = 4000
T2 calculates: 5000 – 2000 = 3000
Now:
T1 writes ₹4000
After that, T2 writes ₹3000
Final result becomes ₹3000.
Correct balance should be:
5000 – 1000 – 2000 = ₹2000
Here, T1 update is lost. This is called Lost Update Problem.
(b) Dirty Read Problem
Dirty read occurs when one transaction reads data written by another transaction that has not
yet been committed.
Example (Online Shopping System):
Transaction T1: Updates product price from ₹500 to ₹800 (not yet committed).
Transaction T2: Reads updated price as ₹800 and shows it to customer.
Later, T1 fails and rolls back. Price returns to ₹500.
But customer already saw ₹800.
This incorrect reading is called Dirty Read because uncommitted data was read.
(c) Inconsistent Retrieval Problem
Inconsistent retrieval happens when a transaction reads some values before another
transaction updates them and some values after the update, leading to incorrect results.
Example (Bank Report System):
Suppose:
Account A = ₹3000
Account B = ₹5000
Transaction T1: Transfer ₹1000 from A to B.
Transaction T2: Calculates total balance.
T2 reads A = ₹3000
Then T1 updates A = ₹2000 and B = ₹6000
Now T2 reads B = ₹6000.
Total seen by T2 = 3000 + 6000 = ₹9000
Actual balance = 2000 + 6000 = ₹8000
Report is incorrect due to inconsistent data.
(d) Phantom Read Problem
Phantom read occurs when a transaction gets different results for the same query executed
twice because another transaction inserts or deletes rows.
Example (University Database):
Transaction T1:
SELECT all students where marks > 70.
Result: 5 students.
Transaction T2 inserts another student with marks = 75.
T1 runs query again and gets 6 students.
The new row appears like a “ghost” result → Phantom Read.
Techniques of Concurrency Control in DDBMS
Concurrency control in a Distributed Database Management System (DDBMS) is required to
ensure that multiple transactions do not interfere with one another while accessing shared
data. Since data is stored at different sites, control is necessary to avoid inconsistency.
Distributed Locking Protocol
In this technique, before a transaction can access a data item, it must obtain a lock on that
item. Other transactions cannot change that data until the lock is released.
Types of Locks
(a) Shared Lock (Read Lock)
Allows only reading of data. Multiple transactions can hold this lock at the same time.
(b) Exclusive Lock (Write Lock)
Allows modification. Only one transaction is allowed.
Working
1. The transaction requests a lock.
2. If lock is available, it is granted.
3. If lock is busy, the transaction waits.
4. After completion, the lock is released.
Real-World Example (Bank System)
Customer A withdraws money.
Customer B deposits money at the same time.
The system locks the account.
Customer B must wait until A finishes.
Correct balance is maintained.
Advantages
Simple technique
Avoids inconsistency
Provides safety
Disadvantages
Deadlocks
Delays
Performance decrease
Timestamp Ordering Protocol
Each transaction is given a timestamp when it starts execution. Transactions are processed in
timestamp order to maintain consistency.
Rules
Older transaction is executed first.
If newer transaction violates rule, it is restarted.
Real-World Example (Ticket Booking System)
User A books ticket at 10:00 AM.
User B books ticket at 10:01 AM.
A is processed first.
If B conflicts, it is restarted.
Advantages
No deadlock
High parallel processing
Disadvantages
Starvation
Extra restarts
Optimistic Concurrency Control
This method assumes conflicts are rare. Transactions proceed without locking and are
checked only at commit time.
Phases
1. Read Phase – Data read into memory.
2. Validation Phase – Conflict checked.
3. Write Phase – Data saved if valid.
Real-World Example (Online Examination System)
Students attempt questions freely.
Final submission is validated before saving.
If conflict occurs, submission is rejected.
Advantages
No locks required
High speed if conflicts are low
Disadvantages
Rollbacks
High validation cost
Quorum-Based Protocol
In this technique, a transaction can read or write data only after receiving permission from a
minimum number of database replicas.
Types
Read quorum
Write quorum
Real-World Example (Cloud Storage System)
File is updated only after approval from multiple servers.
Advantage
Reduces network load
Disadvantage
Communication delays
DISTRIBUTED QUERY PROCESSING
Meaning of Distributed Query Processing
Distributed Query Processing is the method used to process SQL queries when data is stored
across multiple sites in a distributed database system.
When a user submits a query, the system:
Finds where the data is located
Divides the query into parts
Sends sub-queries to appropriate sites
Collects results
Combines them into a final answer
Definition
Distributed Query Processing is the process of analyzing, optimizing, and executing queries
in a distributed database system where data is stored at different network locations.
Objectives of Distributed Query Processing
The main objectives are:
Minimize data transfer cost
Reduce response time
Improve performance
Efficient use of network bandwidth
Locate data quickly
Execute queries in parallel
Generate correct results
Phases of Distributed Query Processing
Distributed query processing is completed in three main phases:
4.1 Query Decomposition
The query is broken into smaller parts and converted into internal form.
Steps:
Parsing the query
Syntax checking
Converting to relational algebra
Removing redundant data
4.2 Query Optimization
Best execution plan is selected.
Optimization decides:
Where to process query
How much data to transfer
Whether to send query or data
Which site should execute which part
Types of Optimization:
(a) Static Optimization
Plan decided before execution.
(b) Dynamic Optimization
Plan decided during execution.
4.3 Query Execution
After optimization, query is executed.
Steps:
Sub-queries sent to sites
Local queries executed
Results transferred
Final result assembled
Query Processing Techniques in Distributed Database
In a Distributed Database Management System (DDBMS), data is stored at different sites
(branches, servers, locations).
When a user fires a query, the system must decide:
Where to execute the query?
Which site will process which part?
Should we send data to query or query to data?
For this, different query processing techniques are used.
Data Localization
Meaning
Data Localization means converting a global query (written as if data is in one big database)
into local queries for each site where the data is actually stored.
The user writes:
SELECT * FROM STUDENT WHERE CITY = 'Amritsar';
User doesn’t care where data is stored.
The DDBMS internally:
Finds which sites have STUDENT data
Breaks query into smaller parts
Sends each part to appropriate site
Collects and merges the results
This conversion of one global query into multiple local queries is called data localization.
Steps in Data Localization
1. Identify data location
o Check which sites have required relations or fragments.
2. Rewrite global query
o Convert global tables into fragments (horizontal / vertical).
3. Generate local sub-queries
o One query for each participating site.
4. Execute locally at each site
o Each site processes its own sub-query.
5. Combine all partial results
o Final result sent back to user.
Example (University Database)
Assume STUDENT table is horizontally fragmented:
Site A (Amritsar Campus): STUDENT where CITY = 'Amritsar'
Site B (Jalandhar Campus): STUDENT where CITY = 'Jalandhar'
Site C (Ludhiana Campus): STUDENT where CITY = 'Ludhiana'
Global Query:
SELECT NAME, CITY FROM STUDENT WHERE CITY = 'Amritsar' OR CITY =
'Jalandhar';
Data Localization:
At Site A:
SELECT NAME, CITY FROM STUDENT WHERE CITY = 'Amritsar';
At Site B:
SELECT NAME, CITY FROM STUDENT WHERE CITY = 'Jalandhar';
At Site C:
No query needed (Ludhiana campus not required).
The results from Site A and Site B are combined and shown to the user as one result.
Advantages of Data Localization
Reduces amount of unnecessary data transfer
Uses local processing power
Keeps global query independent of physical distribution
Better performance
Disadvantages
Requires knowledge of fragmentation and data distribution
Query rewriting is complex
Centralized Processing
Meaning
In Centralized Processing, all required data is brought to one central site, and the entire
query is processed at that single site.
Here, we send data to query, not query to data.
How It Works
1. User sends query to central server (Coordinator site).
2. Central site identifies which sites contain required data.
3. Data from those sites is transferred to central site.
4. Central site executes full query.
5. Final result returned to user.
Example (Bank Head Office System)
Assume:
Branch A, B, C each store local ACCOUNT data.
Head office (Central site) wants:
SELECT BRANCH_NAME, SUM(BALANCE)
FROM ACCOUNT
GROUP BY BRANCH_NAME;
Centralized Approach:
Branch A sends its ACCOUNT data to Head Office
Branch B sends its ACCOUNT data to Head Office
Branch C sends its ACCOUNT data to Head Office
Head Office executes the GROUP BY and SUM
Head Office sends final summary report to manager
All processing of query is done at Head Office only.
Advantages of Centralized Processing
Implementation is simple
Head Office controls everything
No need for complex distributed algorithms
Easier to manage security and access control
Disadvantages
High network cost (large data movement)
Central site becomes a bottleneck
If central site fails → whole system stops
Poor scalability for big data
Distributed Processing
Meaning
In Distributed Processing, each site executes its own part of the query, and only the
necessary intermediate results are sent over the network.
Here we send query to data, not data to query.
Processing is shared among multiple sites → parallel execution.
How It Works
1. Global query is decomposed into several sub-queries.
2. Sub-queries are sent to different sites where required data is stored.
3. Each site executes its part locally (using its own DBMS).
4. Partial results are sent back to a coordinator site.
5. Coordinator combines these partial results to produce the final output.
Example (Online Shopping System)
Suppose tables:
CUSTOMER at Site A
ORDERS at Site B
PAYMENT at Site C
Query:
SELECT [Link], O.ORDER_ID, [Link]
FROM CUSTOMER C, ORDERS O, PAYMENT P
WHERE [Link] = [Link] AND [Link] = [Link]
AND [Link] = 'Dinanagar';
Distributed Processing:
At Site A (CUSTOMER):
Filter customers from Dinanagar
At Site B (ORDERS):
Get orders related to those customers
At Site C (PAYMENT):
Get payment details for those orders
Then partial results are joined and final output is prepared at coordinator site.
Advantages of Distributed Processing
Uses all sites’ CPU power (parallelism)
Reduces amount of data transferred
Faster response for large queries
Scalable and efficient
Disadvantages
More complex query planning and optimization
Requires good coordination between sites
Network failure can disturb execution
6. Query Decomposition Steps in Distributed Query Processing
Query Decomposition is the first and most important stage of distributed query processing. In
this stage, the user’s SQL query is converted into an internal form so that it can be executed
efficiently over distributed sites.
The main aim is to:
Check correctness
Reduce complexity
Improve performance
Generate an optimized query
6.1 Normalization
Meaning
Normalization in query decomposition means converting the SQL query into a standard
internal form, usually in the form of relational algebra or a query tree.
This allows the system to understand the query clearly and apply optimization techniques
easily.
Example
SQL Query:
SELECT * FROM STUDENT WHERE (MARKS > 60 AND CITY = 'Delhi') OR (CITY =
'Delhi' AND MARKS > 60);
Normalized Form:
SELECT * FROM STUDENT WHERE MARKS > 60 AND CITY = 'Delhi';
Redundant condition is removed and query is written in clear form.
Importance of Normalization
Removes duplicate conditions
Converts query into simple format
Makes query easy to optimize
Avoids repeated work
6.2 Analysis
Meaning
In this step, the syntax and logic of the query is checked.
The system ensures:
Table names are valid
Column names are correct
Data types match
Conditions make sense logically
No ambiguity is present
Example
Query:
SELECT NAME FROM STUDENT WHERE AGE = 'abc';
This query is incorrect because AGE is numeric.
During analysis, this error is detected and query is rejected.
Importance of Analysis
Finds errors early
Prevents crash
Ensures only valid queries are executed
Avoids incorrect results
6.3 Simplification
Meaning :In this stage, the query is simplified by:
Removing unnecessary conditions
Eliminating impossible conditions
Rewriting redundant operations
Example
Query:
SELECT * FROM EMPLOYEE
WHERE SALARY > 5000 AND SALARY > 3000;
Simplified Query:
SELECT * FROM EMPLOYEE
WHERE SALARY > 5000;
Second condition is useless and removed.
Another Example
SELECT * FROM STUDENT WHERE ROLLNO = 10 AND ROLLNO = 11;
This condition is impossible. The query is simplified to return no result.
Importance of Simplification
Reduces work
Improves performance
Avoids waste of resources
Makes query execution faster
6.4 Restructuring
Meaning
In restructuring, the query is rewritten into a form that is more efficient to execute.
This includes:
Reordering joins
Moving conditions closer to data
Selecting smaller result sets
Choosing better execution path
Example
Original Query:
SELECT *
FROM EMPLOYEE E, DEPARTMENT D
WHERE [Link] = [Link]
AND [Link] = 'Sales';
Restructured Query:
SELECT *
FROM EMPLOYEE E,
(SELECT * FROM DEPARTMENT WHERE NAME = 'Sales') D
WHERE [Link] = [Link];
Filtering DEPARTMENT first reduces result size and speeds up join.