0% found this document useful (0 votes)
6 views

module 2 nosql

The document discusses NoSQL databases, focusing on distribution models such as replication and sharding, which enable scalability and performance for large datasets. It outlines various data distribution techniques, including single-server setups, master-slave replication, and peer-to-peer replication, highlighting their benefits and challenges. The document emphasizes the importance of choosing the right distribution strategy based on application requirements and data access patterns.

Uploaded by

prasadmaruthi272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

module 2 nosql

The document discusses NoSQL databases, focusing on distribution models such as replication and sharding, which enable scalability and performance for large datasets. It outlines various data distribution techniques, including single-server setups, master-slave replication, and peer-to-peer replication, highlighting their benefits and challenges. The document emphasizes the importance of choosing the right distribution strategy based on application requirements and data access patterns.

Uploaded by

prasadmaruthi272
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

SAI VIDYA INSTITUTE OF TECHNOLOGY

RAJANUKUNTE, BENGALURU 560 064, KARNATAKA

STUDY MATERIAL
Module-2 NOSQL

Course Title: NOSQL Course Code: BCD515C


Faculty Name: LAKSHMI DURGA.N
Department: DS Credit: 03

MODULE 2

NOSQL DATABASE

DISTRIBUTION MODELS

The Rise of NoSQL: Scaling Out with Distribution Models


The primary driver behind the adoption of NoSQL databases has been their ability to
seamlessly operate on large clusters. As data volumes escalate, scaling up by
investing in more powerful servers becomes increasingly challenging and costly. In
contrast, scaling out by distributing the database across multiple servers offers a more
attractive solution.

Benefits of Distribution Models


The aggregate-oriented design of NoSQL databases lends itself naturally to
distribution. By employing distribution models, you can create a data store capable of:
Handling vast amounts of data
Processing high volumes of read and write traffic
Ensuring greater availability despite network slowdowns or outages
While these benefits are substantial, they come at the cost of added complexity.
Therefore, distributing data across a cluster should only be considered when the
advantages outweigh the complexity.
Data Distribution Techniques
MODULE 2 NOSQL DATABASE

There are two primary paths to data distribution:


Replication: Duplicate data across multiple nodes.
Sharding: Distribute distinct data across separate nodes.
Notably, replication and sharding are complementary techniques that can be used
independently or in conjunction.
Replication Forms
Replication takes two forms:
Master-Slave Replication: One primary node accepts writes and replicates data to
secondary nodes.
Peer-to-Peer Replication: All nodes accept writes and replicate data among each
other.
Exploring Distribution Techniques
We will delve into the following distribution techniques, progressing from simplest to
most complex:
Single-Server: A single server manages all data and traffic.
Master-Slave Replication: A master node replicates data to slave nodes.
Sharding: Data is divided and stored across multiple nodes.
Peer-to-Peer Replication: Nodes function as both masters and slaves, replicating
data among each other.
By grasping these distribution models and techniques, developers can select the
optimal approach for their specific use case, striking a balance between scalability,
performance, and complexity.

4.1. Single Server


The simplest and most recommended option for data distribution is no distribution at
all—running the database on a single machine that manages all the reads and writes.
This method is favored because it avoids the complexities that come with distributed
systems. With a single-server setup, it’s easier for both operations teams to maintain
and for developers to work with, as there's no need to worry about synchronizing data
across multiple nodes or handling network-related issues.
Although many NoSQL databases are designed for use in clusters, there are cases
where a single-server model still makes sense. This is especially true when the
NoSQL database's data model better suits the application. A notable example is graph

Lakshmi Durga.N Dept of DS, SVIT 2


MODULE 2 NOSQL DATABASE

databases, which typically work more efficiently when run on a single server.
Similarly, if your application primarily involves working with aggregates (e.g.,
collections of related data that are retrieved and processed together), then using a
single-server document or key-value store can be a simpler and effective solution. It
reduces the complexity of managing distributed systems and eases the workload for
application developers.
Despite the fact that more complex distribution strategies will be discussed
throughout the chapter, it’s important to note that single-server setups are often
preferable. If the application’s requirements allow for it, avoiding data distribution is
ideal. Whenever feasible, sticking to a single-server approach is the first choice, as it
minimizes potential issues and keeps the system simpler.

4.2. Sharding
Sharding is a technique used to achieve horizontal scalability by distributing
different parts of a dataset across multiple servers (or nodes). This method can greatly
enhance performance, particularly for systems where different users are accessing
different parts of the data. In an ideal scenario, each user interacts with a single server
node, which handles both reading and writing. This balances the load evenly between
the servers, so if there are ten servers, each would handle approximately 10% of the
overall load.

Lakshmi Durga.N Dept of DS, SVIT 3


MODULE 2 NOSQL DATABASE

Figure 4.1. Sharding puts different data on separate nodes, each of which does its own
reads and writes.
However, the ideal scenario is rare. To approach it, you need to ensure that data
frequently accessed together is stored on the same node. This is where aggregate
orientation becomes useful. Aggregates are collections of data designed to be
accessed together, making them a natural unit for distribution across different servers.
When deciding how to arrange data on nodes, several factors can improve
performance:
1. Physical location-based distribution: If data is primarily accessed from specific regions,
placing the data near the users’ locations can reduce access time. For example, orders
from Boston customers could be stored in a data center on the U.S. East Coast.
2. Even load distribution: Aggregates should be spread evenly across nodes to ensure that
no single node is overloaded. This distribution might need adjustments over time based on
changes in data access patterns.
3. Sequential reading optimization: If certain data is accessed sequentially, grouping it
together can improve efficiency. For instance, organizing web pages by reversed domain
names allows multiple pages to be processed together.
Historically, manual sharding was done by embedding logic into the application,
such as assigning customers with last names starting from A-D to one shard, and so
on. However, this approach complicates programming and requires significant
changes when data needs rebalancing across shards. To simplify this, many modern
NoSQL databases offer auto-sharding, where the database itself handles data
allocation to shards and directs queries to the appropriate shard.
Sharding offers significant performance benefits, especially for write-heavy
applications. While replication and caching can improve read performance, sharding
is one of the few ways to horizontally scale writes. However, sharding does not
inherently improve resilience. If a node containing a shard fails, that shard’s data
becomes unavailable, affecting all users who rely on that data. Although only a
portion of the database is affected in a sharded system, partial data loss is still
problematic. In fact, sharded clusters often use less reliable machines, making node
failure more likely than in single-server systems, thus potentially reducing resilience.
Because sharding can be complex, it should not be implemented hastily. Some
databases are designed to use sharding from the start, and these should be run on

Lakshmi Durga.N Dept of DS, SVIT 4


MODULE 2 NOSQL DATABASE

clusters early in development. For other databases, sharding is used as a scaling


option, and it's best to begin with a single-server configuration and only move to
sharding when data demands exceed the single server's capacity. However, sharding
should not be left to the last minute, as enabling it in production without proper
planning can overwhelm the database and lead to downtime. The key takeaway is to
implement sharding before it becomes absolutely necessary, while there’s still
enough capacity to handle the transition smoothly.

4.3. Master-Slave Replication


Master-slave replication is a data distribution strategy where data is replicated across
multiple nodes in a network. One node, called the master (or primary), is responsible
for handling all data updates. The other nodes, known as slaves (or secondaries),
maintain copies of the data through a replication process that keeps them
synchronized with the master.
In this setup, the master node services all write operations, meaning that any changes
or updates to the data must go through the master. However, read operations can be
performed by either the master or the slaves. This replication model is particularly
effective for scaling read-heavy datasets, as adding more slave nodes can distribute
the read load, allowing for horizontal scalability. Essentially, as more slaves are added,
more read requests can be handled without overloading the master.

Figure 4.2. Data is replicated from master to slaves. The master services all writes;

Lakshmi Durga.N Dept of DS, SVIT 5


MODULE 2 NOSQL DATABASE

reads may come from either master or slaves.


Despite its advantages for read-heavy workloads, master-slave replication has
limitations when it comes to scaling write-heavy datasets. Since all writes go through
the master, the master’s capacity to handle write requests becomes a bottleneck.
Offloading read requests to the slaves can help reduce the load on the master, but it
does not eliminate the master’s write processing limits.
Another advantage of master-slave replication is read resilience. If the master node
fails, the slave nodes can still handle read requests, allowing the system to continue
serving data, as long as the data is mostly being read. However, during this time, no
new updates (writes) can be made until either the master is restored or one of the slave
nodes is promoted to become the new master.
Promoting a slave to a master can happen in two ways:
Manual promotion: An administrator manually designates a slave to take over as the
new master in case of failure.
Automatic promotion: The system is configured to automatically elect a new master
from the available slaves if the current master fails. This ensures faster recovery and
less downtime.
Even in systems where scaling is not the primary concern, master-slave replication
can be used for backup purposes. In such cases, all traffic (reads and writes) can go
through the master, while a slave serves as a hot backup. This setup ensures quick
recovery in case of failure, as the slave can quickly take over as the master.
To take full advantage of the read resilience provided by master-slave replication, it’s
important to separate the read and write paths in the application, ensuring that
failures in the write path (i.e., the master) don’t disrupt read operations (handled by
the slaves). This often requires setting up separate database connections for reads and
writes, a feature that may not be supported by all database libraries.
While master-slave replication offers significant benefits, it also introduces the risk of
data inconsistency. Because updates propagate from the master to the slaves, there’s
always a delay in synchronization, meaning different slaves could have slightly
different versions of the data at any given moment. This delay creates a risk where a
client reading from one slave might see outdated data or may not see an update it just
made via the master, leading to read-write inconsistency.
Inconsistency is a critical challenge in master-slave systems, especially when a master

Lakshmi Durga.N Dept of DS, SVIT 6


MODULE 2 NOSQL DATABASE

fails. If updates haven't fully propagated to all slaves, data loss can occur, as the most
recent updates might not have been replicated. This is something to be aware of when
using replication, and later discussions on consistency can help address these
concerns.

4.4. Peer-to-Peer Replication


Peer-to-peer replication is a distribution model where all nodes in a cluster are equal.
Unlike master-slave replication, there is no single authoritative master node. Instead,
all nodes (or replicas) in the system can handle both reads and writes, and if one node
fails, the others can continue operating without losing access to data. This makes peer-
to-peer replication a highly resilient and scalable option.
In this model, the ability for any node to accept writes means there’s no single point
of failure, which is a significant advantage over the master-slave setup. Since each
node can handle write operations, the system is better suited for handling write-heavy
workloads, and adding more nodes can enhance both read and write performance.
Additionally, the loss of any single node does not affect the overall availability of the
system because the other nodes can continue to serve data.

Lakshmi Durga.N Dept of DS, SVIT 7


MODULE 2 NOSQL DATABASE

Figure 4.3. Peer-to-peer replication has all nodes applying reads and writes to all the
data.
However, the primary challenge with peer-to-peer replication is ensuring consistency.
When multiple nodes are allowed to write to the same dataset, the risk of write-write
conflicts arises. This occurs when two users or processes attempt to update the same
record on different nodes at the same time. Inconsistent read operations—where
different nodes may show slightly different versions of data due to delays in
replication—are less concerning because they are usually temporary and can be
resolved as the system synchronizes. But inconsistent writes are more problematic,
as they can result in permanent data inconsistencies that need to be resolved manually
or through automated conflict resolution mechanisms.
To handle these inconsistencies, two broad strategies are typically
used:
Coordination of writes: In this approach, before a write is confirmed, the replicas
communicate with each other to ensure consistency. They can agree on the validity of
a write, preventing conflicts from happening in the first place. This coordination
process can be achieved by using a majority vote among the nodes (i.e., as long as a
majority of the replicas agree on a write, it is considered valid). While this strategy
guarantees consistency similar to that of a master-slave model, it comes at the cost of
increased network traffic and slower write performance due to the time required for
coordination.
Handling inconsistent writes: At the opposite end of the spectrum, the system may
allow inconsistent writes to occur and then rely on conflict resolution policies to
merge or resolve these inconsistencies later. This strategy provides the full
performance benefit of peer-to-peer replication, as there is no need to coordinate
between nodes before writing, but it requires a solid plan for dealing with conflicting
data. Certain applications or domains can tolerate these inconsistencies if appropriate
merge strategies are in place.
Ultimately, peer-to-peer replication involves a trade-off between consistency and
availability. You can opt for stronger consistency by coordinating writes, but this
sacrifices some of the system's availability and performance. Alternatively, you can
prioritize availability and performance by allowing writes to any replica, but this
increases the likelihood of inconsistencies that need to be resolved. The choice

Lakshmi Durga.N Dept of DS, SVIT 8


MODULE 2 NOSQL DATABASE

depends on the specific needs of the application and the importance of data
consistency versus system availability.

4.5. Combining Sharding and Replication


Combining replication and sharding is a strategy that allows for both increased
scalability and resilience in distributed systems. Here's how it works:
Master-Slave Replication with Sharding:
In this approach, the data is first sharded across different nodes. Each shard holds a
different portion of the dataset, so a shard can have a separate master and its replicas
(slaves).
Each shard's master node is responsible for processing writes and synchronizing the
updates with its slave nodes.
This means that while each shard has its own master for data replication, there is no
single master for the entire dataset. The dataset is distributed across multiple master
nodes, and each master is responsible for different portions of the data.

Figure 4.4. Using master-slave replication together with sharding


A master-slave replication system ensures that each shard’s data is replicated across
the slaves. Depending on your configuration, a single node might act as a master for
one shard and as a slave for another.
Advantages: By combining sharding and replication, you benefit from horizontal
scalability (sharding distributes the data load across multiple servers) and read
scalability and resilience (master-slave replication allows for faster reads and better

Lakshmi Durga.N Dept of DS, SVIT 9


MODULE 2 NOSQL DATABASE

fault tolerance since slaves can continue serving read requests even if the master fails).

Peer-to-Peer Replication with Sharding:


In this model, sharding is used to divide the dataset among multiple nodes, but
instead of having a master-slave relationship, peer-to-peer replication is used. Here,
all nodes are treated equally, and each shard’s data is replicated across multiple nodes
without a single authoritative master.
A typical configuration might use a replication factor of 3, meaning each shard's data
is stored on three different nodes. This allows the system to withstand the failure of
one or two nodes without data loss because the remaining replicas can continue
serving data.
If a node fails, the system will automatically rebuild the shards that were on the failed
node onto other nodes in the cluster, ensuring that there are always three copies of the
data available.
Advantages: Peer-to-peer replication provides strong fault tolerance and high
availability since any node can accept read and write requests. Sharding ensures that
the data load is spread across multiple nodes, improving scalability for large datasets
and workloads.

Figure 4.5. Using peer-to-peer replication together with sharding


In summary, by combining sharding with replication, you get the best of both
worlds: sharding distributes the data across multiple nodes for scalability, while
replication ensures fault tolerance and availability by maintaining multiple copies of
the data across different nodes. Whether you use master-slave replication or peer-
to-peer replication depends on the specific needs of your system, such as the balance
between consistency, availability, and scalability.

Lakshmi Durga.N Dept of DS, SVIT 10


MODULE 2 NOSQL DATABASE

4.6. Key Points


• There are two styles of distributing data:
• Sharding distributes different data across multiple servers, so each server acts as the
single source for a subset of data.
• Replication copies data across multiple servers, so each bit of data can be found in
multiple places. A system may use either or both techniques.
• Replication comes in two forms:
• Master-slave replication makes one node the authoritative copy that handles writes
while slaves synchronize with the master and may handle reads.
• Peer-to-peer replication allows writes to any node; the nodes coordinate to
synchronize their copies of the data. Master-slave replication reduces the chance of
update conflicts but peer-to-peer replication avoids loading all writes onto a single
point of failure.

Lakshmi Durga.N Dept of DS, SVIT 11


MODULE 2 NOSQL DATABASE

CONSISTENCY
When transitioning from a centralized relational database to a cluster-based
NoSQL database, one of the most significant changes is how you approach
consistency. In relational databases, the goal is often to maintain strong consistency,
ensuring that inconsistencies are minimized or avoided altogether. However, in the
world of NoSQL, terms like the CAP theorem and eventual consistency become
more prominent, and you’re quickly faced with decisions about the level of
consistency your system requires.
Consistency in databases comes in different forms, and the term "consistency" can
refer to a wide range of potential issues that may arise. So, it's important to first
understand the different types of consistency models that exist. Then, we can explore
why you might choose to relax consistency or even durability under certain
conditions, especially in NoSQL systems.

5.1. Update Consistency


Let's consider a situation where two people, Martin and Pramod, both notice that a
company's phone number on the website is out of date. Coincidentally, they both try
to update it at the same time, but they use slightly different formats. This is known as
a write-write conflict, where two people are updating the same data simultaneously.
When these updates reach the server, the server has to serialize them, meaning it must
decide which update to apply first. For example, it might choose to apply Martin's
update before Pramod's based on alphabetical order. In this case, Martin's update
would be overwritten by Pramod's because there was no control over concurrency.
This is called a lost update. While this example may not seem like a big issue, lost
updates can be critical in many cases.
This situation is viewed as a failure in consistency, because Pramod's update was
based on the data before Martin's change, yet it was applied afterward, essentially
ignoring Martin's input.
There are two common approaches to maintaining consistency in these situations:
pessimistic and optimistic approaches.
 Pessimistic approach: This approach prevents conflicts from happening. The
most common method is using write locks, where only one person (Martin, in

Lakshmi Durga.N Dept of DS, SVIT 12


MODULE 2 NOSQL DATABASE

this case) can obtain the lock to update the value at a time. Once Martin has the
lock and updates the value, Pramod would see the result of Martin's update
before attempting his own change.
 Optimistic approach: This approach allows conflicts to occur but detects them
and resolves them afterward. For example, after Martin's update succeeds,
Pramod's update would fail because the system would check if the data had
changed before Pramod's update was applied. In this case, Pramod would receive
an error, prompting him to check the updated value and decide whether to retry
his update.
Both the pessimistic and optimistic approaches rely on consistent serialization of
updates. With a single server, this is easy since the server can only apply one update
at a time. However, in distributed systems with peer-to-peer replication, different
nodes might apply the updates in a different order, leading to inconsistent data across
nodes.
Another optimistic solution to handle write-write conflicts is to save both updates
and record the conflict. This is similar to how version control systems handle
conflicting commits. The user can then be asked to resolve the conflict manually (e.g.,
deciding which phone number format to keep), or the system might attempt to
automatically merge the updates (e.g., standardizing the phone number format).
Many people, when first encountering these types of conflicts, instinctively prefer
pessimistic concurrency to avoid them. However, there is always a trade-off
between safety (avoiding errors like lost updates) and liveness (quick system
responses). Pessimistic approaches can slow down a system significantly, especially
when multiple users are trying to update the same data. Additionally, pessimistic
concurrency can introduce other problems, such as deadlocks, which are difficult to
debug.
In distributed systems, replication increases the likelihood of write-write conflicts
because different copies of the same data can be independently updated. One way to
manage this is to have a single node handle all writes, which simplifies maintaining
update consistency. Many distribution models (except peer-to-peer replication)
follow this pattern to avoid conflicts.

Lakshmi Durga.N Dept of DS, SVIT 13


MODULE 2 NOSQL DATABASE

5.2. Read Consistency


we're discussing two important aspects of database consistency: logical consistency
and replication consistency, and how they can be affected by various scenarios in
both relational and NoSQL databases. Let’s break it down:

Figure 5.1. A read-write conflict in logical consistency

1. Logical Consistency and Read-Write Conflicts


The example starts with an order that has line items and a shipping charge. The
shipping charge depends on the total price of the line items. If Martin adds a line
item and updates the shipping charge afterward, but Pramod reads the line items and
shipping charge before the shipping charge update, then Pramod will see an
inconsistent view of the data. This is a read-write conflict, where the read happens
during an ongoing write operation, causing an inconsistent read. This is an example
of logical inconsistency—the shipping charge and line items don't "make sense"
together at the time Pramod reads them.
Lakshmi Durga.N Dept of DS, SVIT 14
MODULE 2 NOSQL DATABASE

In relational databases, this issue is avoided using transactions, where multiple


operations (like updating line items and the shipping charge) are treated as a single
atomic unit. If Martin wraps his updates inside a transaction, Pramod will either see
all the updates or none, maintaining logical consistency.
2. NoSQL and Aggregate-Oriented Databases
There's a common claim that NoSQL databases don't support transactions, leading to
inconsistencies. However, this isn’t entirely accurate. Some NoSQL databases,
especially aggregate-oriented ones (which group related data together in an
aggregate), support atomic updates—but only within a single aggregate. This means
that as long as all related data (line items, shipping charge) is part of a single
aggregate, logical consistency can still be achieved. However, updates affecting
multiple aggregates introduce a potential inconsistency window, where clients might
read inconsistent data for a short period.
For example, in Amazon SimpleDB, this inconsistency window is usually very short,
often less than a second.
3. Replication Consistency and Eventual Consistency
The introduction of replication (data stored across multiple nodes or servers) leads to
a new type of inconsistency called replication inconsistency. Consider the case
where Martin in London and Cindy in Boston are trying to book the last room in a
hotel, and Pramod in Mumbai books it first. The system updates that the room is
booked, but due to network delays, Cindy sees the booking immediately, while
Martin still sees the room as available because the update hasn’t reached his server
yet. This is an inconsistent read caused by the lag in updating replicas (different
copies of the data on different servers).

Lakshmi Durga.N Dept of DS, SVIT 15


MODULE 2 NOSQL DATABASE

Figure 5.2. An example of replication inconsistency

This situation leads to eventual consistency, where updates will eventually reach all
replicas, but in the meantime, clients may see stale (outdated) data. The inconsistency
will be resolved after a short period, but there’s still a window during which
inconsistent data may be read.
4. Inconsistency Windows and Different Levels of Consistency
Replication consistency can worsen logical inconsistencies by extending the time of
the inconsistency window. For example, an inconsistency window on the master node
might be very short, but network delays can cause it to last much longer on slave
nodes
Applications often allow developers to choose different levels of consistency for
specific operations. Weak consistency can be acceptable most of the time, but for
critical operations, strong consistency might be required

Lakshmi Durga.N Dept of DS, SVIT 16


MODULE 2 NOSQL DATABASE

5. Read-Your-Writes and Session Consistency


In some cases, like posting a comment on a blog, inconsistency windows aren’t a
major issue, but users expect read-your-writes consistency—meaning that once they
post a comment, they should immediately see it in future reads.
To ensure this, you can use session consistency, where during a user's session, they
can always read their most recent updates. One method is using sticky sessions,
where a user is tied to the same server throughout their session, ensuring consistent
reads from that server. This, however, reduces the effectiveness of load balancing.
Another method is to use version stamps, where every interaction with the database
includes the latest version of the data that the session has seen. This way, the server
knows to apply any updates before responding.
6. Application Design and Replication Consistency
The concept of replication consistency is not just limited to databases but is
important for overall application design. When users interact with data—reading and
then updating it—designing the system to handle these scenarios carefully is crucial.
It’s risky to keep transactions open during user interactions, as it may lead to conflicts
when the user tries to update the data. One way to handle this is using offline locks or
similar strategies to avoid conflicts.
 Logical consistency ensures that related data makes sense together and can be
managed using transactions in relational databases or atomic updates within
aggregates in NoSQL databases.
 Replication consistency deals with keeping data consistent across different
replicas, and eventual consistency means that all replicas will eventually sync,
though there might be temporary inconsistencies.
 Techniques like read-your-writes consistency, sticky sessions, and version
stamps help manage these inconsistencies in distributed systems.

The CAP Theorem


the CAP theorem, its implications for NoSQL databases, and the trade-offs between
consistency, availability, and partition tolerance in distributed systems. Let's break
down the key points for clarity:

Lakshmi Durga.N Dept of DS, SVIT 17


MODULE 2 NOSQL DATABASE

Figure 5.3. With two breaks in the communication lines, the network partitions into two groups.

1. Understanding the CAP Theorem


Origins: Proposed by Eric Brewer in 2000 and formalized by Seth Gilbert and
Nancy Lynch a few years later, the CAP theorem states that in a distributed system,
you can only achieve two of the following three properties:

 Consistency: All nodes in the system see the same data at the same time.
 Availability: Every request to a non-failing node receives a response, either with
the requested data or an error.
 Partition Tolerance: The system continues to operate despite network partitions,
which separate nodes from each other.

2. Definitions of CAP Properties


Consistency: This aligns with the earlier definition; all nodes must return the same
data for a given request.
Availability: In the CAP context, it means if a node is operational, it will respond to

Lakshmi Durga.N Dept of DS, SVIT 18


MODULE 2 NOSQL DATABASE

every request. Importantly, a node can be non-operational without affecting the


overall definition of availability.
Partition Tolerance: The system can handle communication failures, allowing it to
continue functioning even when nodes can’t communicate with each other (a scenario
known as "split brain").
3. Single-Server vs. Cluster Systems
A single-server system can guarantee consistency and availability but lacks partition
tolerance since it cannot be partitioned. This is typical of many relational databases.
While it’s theoretically possible to create a Consistency-Availability (CA) cluster,
doing so would mean that if a network partition occurs, all nodes would become
unavailable to maintain consistency. This is impractical and costly to implement.
4. Trade-offs in Distributed Systems
In practice, the CAP theorem highlights the need to trade off consistency for
availability when dealing with network partitions. Systems often operate under a
more nuanced approach, allowing for some level of inconsistency to maintain
availability.
It is not a strict binary decision; systems can adjust the level of consistency they offer
based on their needs.
5. Practical Example: Hotel Booking System
Consider a hotel booking system where both Martin in London and Pramod in
Mumbai attempt to book the last available room.

 If both nodes require communication to ensure consistency, a network failure


would prevent either from booking, sacrificing availability.
 Alternatively, designating one node as the master for bookings allows that node
to handle reservations independently, which improves availability but risks
showing inconsistent data to users.

In scenarios where both nodes accept bookings, there’s a chance for overbooking
(e.g., both Martin and Pramod booking the same room). Some businesses accept this
risk due to their operational models.

Lakshmi Durga.N Dept of DS, SVIT 19


MODULE 2 NOSQL DATABASE

6. Handling Inconsistent Writes


The concept of allowing inconsistent writes is exemplified by shopping carts in
systems like Dynamo. Users can add items to their carts regardless of potential
network issues, with the system merging multiple carts during the checkout process.
This approach emphasizes that developers can often accommodate inconsistent
answers, increasing availability and responsiveness.
7. Read Consistency Considerations
The tolerance for stale reads varies by application:

 In financial systems, real-time data accuracy is crucial.


 In media websites, outdated pages may be acceptable for a short time.

Understanding the acceptable duration of inconsistency (or the inconsistency


window) helps tailor system behavior.
8. ACID vs. BASE
NoSQL advocates often contrast the ACID properties of traditional relational
databases with the BASE properties (Basically Available, Soft state, Eventual
consistency). However, BASE lacks precise definitions and is viewed as less useful
than ACID.
Brewer originally intended the ACID-BASE tradeoff to be a spectrum rather than a
binary choice, suggesting that both consistency and availability exist along a
continuum.
9. Reframing the CAP Trade-offs
Instead of strictly framing the CAP theorem as a conflict between consistency and
availability, it is more productive to think of it in terms of consistency vs. latency.
As more nodes participate in a request, response times may increase. Availability can
be defined by the acceptable latency level; if response times exceed this threshold, the
data might be treated as unavailable.
The CAP theorem illustrates the inherent trade-offs in distributed systems,
particularly regarding consistency and availability. By understanding these concepts
and their implications, developers can design systems that meet specific needs while
tolerating certain inconsistencies. This approach can lead to more efficient,
responsive applications that are resilient in the face of network challenges.

Lakshmi Durga.N Dept of DS, SVIT 20


MODULE 2 NOSQL DATABASE

5.4. Relaxing Durability


the trade-offs between durability and performance in database systems, particularly
in the context of ACID properties. Let's break down the key concepts and examples
for clarity:
1. Understanding Durability in ACID
Durability is one of the four ACID properties of database transactions (Atomicity,
Consistency, Isolation, Durability). It guarantees that once a transaction is committed,
it will persist even in the event of a system failure (e.g., a crash or power loss).
While durability is crucial for data integrity, there are scenarios where sacrificing
some degree of durability can lead to significant performance improvements.
2. Trading Off Durability for Performance
When a database operates mainly in memory and periodically flushes updates to disk,
it can provide faster response times. This approach, however, comes with the risk that
any changes made since the last flush could be lost if the server crashes.
Example Scenarios:
User-Session State:
In a high-traffic website, maintaining user-session data is critical for responsiveness.
This data can change frequently, leading to a high demand for quick read/write
operations.
Losing session data is typically not catastrophic; while it might annoy users, a slower
website would have a more significant negative impact. Therefore, it’s a good
candidate for nondurable writes, where updates are made in memory but not
immediately persisted to disk.
It’s often possible to specify durability needs on a call-by-call basis, allowing more
critical updates to force an immediate flush to disk.
Capturing Telemetric Data:
In scenarios where data is collected from devices (like sensors), capturing data at high
rates is often prioritized over the durability of each individual data point.
If the server crashes, it may lose the latest updates, but the benefit of capturing data
quickly might outweigh the downsides of occasional data loss.

Lakshmi Durga.N Dept of DS, SVIT 21


MODULE 2 NOSQL DATABASE

3. Replication Durability Trade-offs


Replicated Data: In systems that replicate data across multiple nodes for redundancy,
durability can be compromised if an update is processed by a master node but not yet
replicated to its slaves (backup nodes).
Example of Replication Failure:

 In a master-slave distribution model, if the master node fails before updates are
sent to the slaves, those updates will be lost.
 When the master node comes back online, it could have conflicting updates with
those made to the slaves while it was down. This scenario highlights a durability
issue since the master acknowledges the update to the client, leading the client to
believe the update was successful when it actually wasn't fully replicated.

4. Strategies to Improve Replication Durability

One way to mitigate the risk of losing updates during replication is to require the
master node to wait for acknowledgments from some replicas before confirming an
update to the client. This strategy can improve durability but has trade-offs:

 Increased Latency: Waiting for acknowledgments slows down the update


process.
 Reduced Availability: If slaves are down or slow to respond, it may make the
cluster unavailable for updates.

5. Flexible Durability Needs

As with general durability, it's advantageous to allow individual calls to specify their
desired level of durability. This flexibility enables developers to balance performance
and reliability based on the specific needs of each operation.

In summary, while durability is a critical property of reliable data storage, there are
cases where relaxing this requirement can lead to improved performance and
responsiveness in database systems. Understanding the trade-offs involved—especially
in high-traffic applications or when managing replicated data—enables developers to
make informed decisions about how best to handle data consistency and persistence.

Lakshmi Durga.N Dept of DS, SVIT 22


MODULE 2 NOSQL DATABASE

This approach can enhance system efficiency while accommodating the potential for
data loss in less critical scenarios.

5.5. Quorums

the trade-offs between consistency, availability, and durability in distributed database


systems, focusing on how to achieve strong consistency through careful management of
node interactions. Let’s break down the key concepts:

1. Trade-offs in Consistency and Durability

Not All or Nothing: When considering consistency and durability, it's important to
understand that you don't have to fully sacrifice one for the other. Instead, you can
adjust the number of nodes involved in operations to find a suitable balance.

2. Quorums for Writes and Reads

Replication Factor: When data is replicated across multiple nodes, the replication
factor (N) represents the total number of copies of the data in the system.

Write Quorum (W): To ensure strong consistency, a majority of nodes must


acknowledge a write operation. For example, with a replication factor of 3, at least 2
nodes must confirm a write (W > N/2). This majority acknowledgment prevents
conflicting writes since only one write can achieve a majority.

Read Quorum (R): The read quorum determines how many nodes you must contact to
ensure you receive the most recent write. The complexity arises because the read
quorum depends on the write quorum:

If W = 2 (two nodes must confirm a write), then R must also be at least 2 to guarantee
that you read the latest data.

 If W = 1 (only one node confirms a write), you must contact all 3 nodes (R = 3)
to ensure you retrieve the most current update. In this case, if a write quorum isn't
met, conflicts can occur, but contacting enough nodes for reads helps detect any
inconsistencies.

Lakshmi Durga.N Dept of DS, SVIT 23


MODULE 2 NOSQL DATABASE

3. Inequalities for Strongly Consistent Reads

The relationship between reads, writes, and replication can be summarized with the
inequality: R + W > N. This means that the total number of nodes contacted for reads
and writes must exceed the total number of replicas to guarantee strong consistency in
reads.

4. Peer-to-Peer vs. Master-Slave Distribution

 In a peer-to-peer distribution model, nodes are equivalent and can participate in


both reads and writes. This requires managing quorums actively.
 In a master-slave model, only the master node handles writes, simplifying the
process. Here, writes don't need to reach a quorum among all nodes, since only the
master node needs to be updated.

5. Choosing the Right Quorums

 The number of nodes contacted during an operation can vary based on the
operation's requirements. For instance:

 A critical update may require a write quorum to ensure consistency.


 A read operation might prioritize speed and can tolerate some staleness,
requiring fewer nodes to respond.

6. Flexible Configuration of Writes and Reads

 It’s possible to configure the system to favor different aspects depending on the
context. For example:

 For fast, strongly consistent reads, you might set a high write quorum (W
= 3) and allow reads from just one node (R = 1). This configuration means
every write must be confirmed by all nodes, making writes slower but
allowing rapid reads.
 In scenarios where you can tolerate some staleness or slower reads, you can
adjust the number of nodes involved in writes and reads to optimize
performance and availability.

Lakshmi Durga.N Dept of DS, SVIT 24


MODULE 2 NOSQL DATABASE

The key takeaway is that there are multiple strategies and combinations of node
interactions that can be tailored to the specific needs of an application. This flexibility
stands in contrast to the simplistic view of a binary trade-off between consistency and
availability often discussed in NoSQL literature.

By understanding these dynamics and the implications of quorum configurations,


developers can make more informed decisions about how to design systems that best
meet their requirements for consistency, availability, and performance.

Key Points
• Write-write conflicts occur when two clients try to write the same data at the same
time. Read write conflicts occur when one client reads inconsistent data in the middle
of another client’s write.
• Pessimistic approaches lock data records to prevent conflicts. Optimistic approaches
detect conflicts and fix them.
• Distributed systems see read-write conflicts due to some nodes having received
updates while other nodes have not. Eventual consistency means that at some point
the system will become consistent once all the writes have propagated to all the nodes.
• Clients usually want read-your-writes consistency, which means a client can write
and then immediately read the new value. This can be difficult if the read and the
write happen on different nodes.
• To get good consistency, you need to involve many nodes in data operations, but
this increases latency. So you often have to trade off consistency versus latency.
• The CAP theorem states that if you get a network partition, you have to trade off
availability of data versus consistency.
• Durability can also be traded off against latency, particularly if you want to survive
failures with replicated data.
• You do not need to contact all replicants to preserve strong consistency with
replication; you just need a large enough quorum

Lakshmi Durga.N Dept of DS, SVIT 25


MODULE 2 NOSQL DATABASE

Version Stamps

Critics of NoSQL databases often highlight their lack of support for transactions, which
are essential for maintaining data consistency. Transactions allow developers to bundle
multiple operations into a single, atomic unit, ensuring that either all operations
succeed or none do. However, many advocates of NoSQL argue that the absence of
traditional transactions is not as concerning as it seems. This is largely because
aggregate-oriented NoSQL databases provide atomic updates within aggregates, which
are designed to handle related data as a cohesive unit.

Nonetheless, it's crucial to consider transactional needs when selecting a database.


Transactions come with their own limitations. Even in systems that support transactions,
there are instances where updates require human input, making it impractical to keep a
transaction open for extended periods. In such cases, we can use version stamps to
manage these updates. Version stamps can help track changes and resolve conflicts,
making them particularly useful in distributed systems where data may be updated
simultaneously by multiple sources.

Business and System Transactions

To maintain update consistency without relying solely on transactions, systems often


implement strategies that are effective even in transactional databases. When users
think about transactions, they typically consider business transactions, such as
browsing a product catalog, selecting items, entering payment information, and
confirming purchases. However, these processes usually do not occur within a single
database transaction, as that would require locking database elements while users
search for their credit card or attend to other interruptions.

Typically, applications initiate a system transaction only at the conclusion of user


interactions to minimize the duration of lock holds. The challenge here is that data may
have changed during the interaction, affecting decisions and calculations. For example,
the price of a product might change, or a customer's address may be updated, leading to
different shipping costs.

Lakshmi Durga.N Dept of DS, SVIT 26


MODULE 2 NOSQL DATABASE

To manage these challenges, various techniques can be employed, such as offline


concurrency, which is also applicable in NoSQL contexts. A particularly effective
method is the Optimistic Offline Lock, which is a form of conditional update. This
technique involves the client re-reading any relevant data before finalizing the
transaction to ensure it hasn't changed since the original read. This is often
implemented using version stamps, which are fields that change whenever the
underlying data is modified. By noting the version stamp at the time of reading, the
system can verify whether the data has been altered before writing updates.

This approach is similar to updating resources in HTTP, where servers use etag headers.
An etag is a unique string representing the version of a resource. When updating a
resource, the client can supply the previous etag. If the resource has changed on the
server, the etags will not match, and the server will return a "412 Precondition Failed"
response, preventing outdated updates.

Some databases offer conditional update mechanisms to ensure that updates are not
based on stale data. While you can implement this check manually, it requires ensuring
that no other process can modify the resource between your read and update, a practice
known as compare-and-set (CAS) operations. In databases, this means comparing a
version stamp rather than a value.

Several methods can be used to construct version stamps:

Counter: A simple counter is incremented with each update, making it easy to


determine which version is newer. However, it requires a centralized server to manage
the counter and prevents duplicates.

GUID: A globally unique identifier can be generated by any system and is unlikely to
produce duplicates. While GUIDs are unique, they are large and cannot be directly
compared for recency.

Content Hash: This involves creating a hash of the resource's content. It provides
uniqueness and can be generated by any node. However, like GUIDs, hashes cannot be
compared for recency and may be lengthy.

Lakshmi Durga.N Dept of DS, SVIT 27


MODULE 2 NOSQL DATABASE

Timestamps: Using the last update's timestamp allows for easy comparisons for
recency and requires no central management. However, synchronized clocks across
multiple machines are essential to prevent corruption and ensure uniqueness.

Combining these methods can yield a composite version stamp, leveraging the
strengths of each approach. For instance, CouchDB employs a combination of counters
and content hashes, enabling effective conflict detection during peer-to-peer replication.

In addition to preventing update conflicts, version stamps facilitate session consistency,


ensuring that users have a consistent view of data throughout their interactions. This is
particularly important in environments where multiple updates may occur
simultaneously, helping to maintain a coherent user experience while navigating
potential data changes.

6.2. Version Stamps on Multiple Nodes

Version Stamps in Data Consistency

In distributed systems, especially in peer-to-peer models, maintaining consistency and


managing updates without a single authoritative source presents unique challenges.
Version stamps are critical in tracking the state of data across multiple nodes, but their
implementation must be adapted from traditional systems where a single master
controls the versioning.

The Challenge of Peer-to-Peer Distribution

In a traditional setup, such as a master-slave model, the master node generates and
manages version stamps, ensuring that all updates follow a consistent order. However,
in a peer-to-peer model, multiple nodes can independently update data, leading to
potential discrepancies. If two nodes return different version stamps when queried, it
indicates that they may have different data states.

Lakshmi Durga.N Dept of DS, SVIT 28


MODULE 2 NOSQL DATABASE

Types of Version Stamps

Counters: The simplest form of version stamping involves using a counter that
increments with each update. For instance, if Node Blue has a version stamp of 4 and
Node Green has a stamp of 6, you can easily determine that Green's data is more
current. This method works well for single-master scenarios but does not suffice in
cases where multiple nodes can make updates independently.

Timestamps: While timestamps are another common approach, they can lead to
complications in ensuring all nodes maintain synchronized clocks. Clock drift can
result in inconsistencies, making it challenging to resolve conflicts effectively.
Timestamps also do not provide sufficient information to detect write-write conflicts,
which limits their utility to scenarios with a single master.

Vector Stamps: The most robust approach in peer-to-peer NoSQL systems is the use
of vector stamps. A vector stamp consists of a set of counters—one for each
participating node. For example, a vector stamp for three nodes (Blue, Green, Black)
might look like this: [blue: 43, green: 54, black: 12]. Each node updates its own counter
independently upon making a change, which allows for tracking the versioning of data
across the system.

Synchronization of Vector Stamps

Whenever two nodes communicate, they synchronize their vector stamps. This process
helps determine the state of data across the network. The comparison of vector stamps
provides insight into which version is newer:

 Newer Version: If all counters in one vector stamp are greater than or equal to
those in another, then it can be considered the newer version.
 Conflict Detection: If both version stamps show a counter greater than the other for
different nodes, a write-write conflict has occurred, indicating that both nodes made
changes that need to be reconciled.

Lakshmi Durga.N Dept of DS, SVIT 29


MODULE 2 NOSQL DATABASE

Handling Missing Values

When nodes are added or if a node has not been involved in recent updates, its counter
may be missing from the vector. Missing values are treated as zero, allowing the
system to accommodate new nodes without invalidating the existing version stamps.
For example, [blue: 6, black: 2] is interpreted as [blue: 6, green: 0, black: 2].

Conflict Resolution and Consistency Tradeoff

While vector stamps are effective for identifying inconsistencies, they do not resolve
conflicts. The resolution depends on the specific application domain and its
requirements. This highlights the consistency/latency tradeoff:

 Network Partitions: In the event of network partitions, you may face challenges in
making the system available or consistent.
 Conflict Management: You can either choose to handle inconsistencies when they
arise or risk making the system unavailable during such partitions

Vector stamps provide a sophisticated method for maintaining consistency in


distributed databases. They allow systems to track updates across multiple nodes
effectively while facilitating the identification of inconsistencies. However, the actual
resolution of conflicts remains a challenge that must be addressed based on the specific
context and requirements of the application, reflecting the inherent tradeoffs between
consistency, availability, and latency in distributed systems.

Key Points
• Version stamps help you detect concurrency conflicts. When you read data, then
update it, you can check the version stamp to ensure nobody updated the data between
your read and write.
• Version stamps can be implemented using counters, GUIDs, content hashes,
timestamps, or a combination of these.
• With distributed systems, a vector of version stamps allows you to detect when
different nodes have conflicting updates.

Lakshmi Durga.N Dept of DS, SVIT 30


MODULE 2 NOSQL DATABASE

Lakshmi Durga.N Dept of DS, SVIT 31

You might also like