Unit IV Rdbms
Unit IV Rdbms
Attack of the Clusters - The Emergence of NoSQL - Aggregate Data Models : Aggregates - Key-Value and Document
Data Models - Column-Family Stores - Summarizing Aggregate-Oriented Databases.
1. The value of Relational databases
Relational databases have become such an embedded part of our computing culture that it’s easy to take them for
granted. It’s therefore useful to revisit the benefits they provide.
1.1 Getting at persistent Data
A database’s main job is to safely store large amounts of information for a long time. Unlike main memory, which is fast
but temporary, databases use backing storage (like disks) that keeps data even when the computer shuts down. Unlike
simple files, databases make it easy to store huge amounts of data and quickly access just the parts you need.
1.2 Concurrency
In big business (enterprise) applications, many people often use the same data at the same time, and some may even try
to change it. Most of the time, they’re working on different parts of the data, but sometimes two people might try to
update the same piece, like booking the same hotel room. To prevent mistakes like double-bookings, we need a way to
carefully manage these interactions.
Handling this (called concurrency) is tricky and full of possible errors, even for skilled programmers. Since enterprise
systems can have many users and other systems accessing data all at once, the chance of problems is high.
Relational databases help with this by using transactions to control all access to the data. Transactions don’t solve every
problem (for example, you can still get an error if two people try to book the same room at the same time), but they
make concurrency much easier to manage.
Transactions are also useful for handling errors. If something goes wrong while making changes, you can roll back the
transaction, which undoes the partial changes and keeps the data clean.
1.3 Integration
Big business systems often involve many different applications, built by different teams, that need to work together. This
teamwork can be tricky because the applications — and the teams behind them — must share information. If one
application updates some data, the others need to see that update too.
A common solution is called shared database integration. This means all the applications store and use data in the same
database. By doing this, each application can easily access the others’ data, and the database’s built-in concurrency
control makes sure everything stays consistent, even when many applications (or users) are using it at the same time.
Example: Imagine an online store. The website app, the inventory app, and the shipping app all need access to the same
product data. If someone buys the last laptop, the website should instantly show “out of stock,” and the shipping app
should prepare the delivery. A shared database ensures that all apps stay in sync.
1.4 A (Mostly) Standard Model
Relational databases became popular because they offer important benefits in a standardized way. This means that
once developers and database experts learn the basics of how relational databases work, they can apply that knowledge
to many different projects.
Even though different relational databases (like Oracle, MySQL, or PostgreSQL) have small differences, the main ideas
stay the same: their SQL languages are very similar, and transactions work in almost the same way across all of them.
2. Impedance Mismatch
Relational databases have many strengths, but they aren’t perfect. A big problem for developers is the “impedance
mismatch”—the gap between how data is stored in relational databases (as tables and rows) and how data is
represented in programming languages (as objects, lists, or nested structures).
Because of this, developers must translate between the two, which can be frustrating. In the 1990s, many thought
object-oriented databases (which stored data like programming languages do) would replace relational databases. But
while object-oriented programming became popular, object-oriented databases faded away. Relational databases stayed
strong because of their standard SQL language and their role in integrating different systems.
Tools like ORM frameworks (e.g., Hibernate, iBATIS) helped reduce the pain by automating much of the translation work,
but they can also cause performance issues if overused. By the 2000s, relational databases still dominated, though
cracks in their dominance began to appear.
6.1 Aggregates
In the relational model, data is stored in tuples (rows). Tuples are simple: each row just holds values. You can’t nest one
row inside another or store lists inside them. This simplicity makes operations clear, since everything works on rows.
Aggregate orientation is different. It allows complex records that can include lists or nested data. Key-value, document,
and column-family databases all use this idea. We call such a complex record an aggregate.
The term comes from Domain-Driven Design, where an aggregate is a group of related objects treated as one unit.
Aggregates are updated together (atomically) and make it easier to:
manage consistency,
replicate or shard across clusters,
and work with in applications, since developers often use these structures directly.
6.1.1 Example of Relations and Aggregates
Suppose we need to build an e-commerce site to sell products online. We must store details about users, products,
orders, shipping, billing, and payments.
In a relational database, we would design a normalized model with multiple linked tables, ensuring no data is repeated
and relationships are maintained.
In a NoSQL (aggregate-oriented) approach, we would group related data together into larger units (aggregates), such as
keeping order details, shipping, and billing info inside the same record.
Again, we have some sample data, which we’ll show in JSON format as that’s a common representation for data in
NoSQL land.
// in customers
{
"id": 1,
"name": "Martin",
"billingAddress": [{ "city": "Chicago" }]
}
// in orders
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress": [{ "city": "Chicago" }],
"orderPayment": [
{
"ccinfo": "1000-1000-1000-1000",
"txnId": "abelif879rft",
"billingAddress": { "city": "Chicago" }
}
]
}
In this model, we have two main aggregates:
Customer → contains billing addresses.
Order → contains order items, shipping address, and payments (each payment also has its own billing address).
Addresses are copied into each aggregate instead of using IDs. This avoids unwanted changes (e.g., an old shipping
address should stay the same for past orders).
Relationships between aggregates are kept separate—for example:
Customer ↔ Order link.
Order item ↔ Product link.
Sometimes product info (like name) is repeated inside the order to avoid extra lookups (denormalization).
The key idea: when designing aggregates, think about how the data will be accessed. For instance, we could also design
it so all customer orders are stored inside the customer aggregate.
Using the above data model, an example Customer and Order would look like this:
// in customers
{
"customer": {
"id": 1,
"name": "Martin",
"billingAddress": [{ "city": "Chicago" }],
"orders": [
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}],
}
}
6.1.2 Consequences of Aggregate Orientation
Relational databases store data well with tables and relationships, but they don’t recognize aggregates (like “an order =
items + shipping address + payment”). They treat all relationships the same, so the database can’t optimize how data is
grouped or stored. That’s why relational databases—and graph databases—are called aggregate-ignorant.
Being aggregate-ignorant isn’t always bad. It makes data flexible for many uses (e.g., analyzing product sales), but it can
be harder to work with when we want to treat related data as a single unit.
Aggregate-oriented databases (like many NoSQL types) explicitly group data (e.g., an order with all its details). This helps
in clusters, because the whole aggregate can live on one node, reducing cross-node queries.
For transactions, relational databases allow ACID operations across many rows/tables. NoSQL aggregate databases
usually support atomic updates only within a single aggregate. If multiple aggregates need updating together, the
application must handle it.
So:
Relational & graph DBs = aggregate-ignorant, flexible, ACID across tables.
Aggregate-oriented DBs = good for clusters, atomic per aggregate, simpler for many use cases.
7. Key-Value and Document Data Models
Key-value and document databases are both aggregate-oriented.
In a key-value database, each aggregate is just a “blob” of data linked to a key. The database doesn’t know
what’s inside—it only stores and retrieves it by key. This gives full freedom but limited access.
In a document database, the aggregate has a defined structure (like JSON). The database can look inside it, run
queries on fields, return only parts of it, and build indexes.
Key differences:
Key-value: Simple, flexible, but only key lookups.
Document: Structured, supports queries, indexing, and partial retrieval.
In practice, the boundary blurs—key-value stores add features (like metadata, lists, or search), and document databases
still use IDs for key-based lookups. But generally:
Key-value = lookup by key.
Document = query by internal structure.
8. Column-Family Stores
Google’s Bigtable was one of the first big NoSQL databases and inspired others like HBase and Cassandra. Although its
name suggests a table, it’s better to think of it as a two-level map.
Before NoSQL, column stores (like C-Store) still used SQL but stored data by columns instead of rows to make reading a
few columns across many rows faster.
Bigtable-style databases (called column-family databases) take a different approach:
Data is grouped into column families.
Each row (aggregate) is identified by a key.
Inside each row, data is stored as a map of columns.
You can fetch the whole row or just a specific column (e.g., get('1234', 'name')).
So, column-family databases mix the ideas of key-value stores and structured access, letting you read data either by row
or by column group.
In column-family databases, data can be seen in two ways:
Row-oriented: Each row is like an aggregate (e.g., a customer) with column families grouping related data
(profile, order history).
Column-oriented: Each column family defines a type of record (e.g., customer profiles), and rows are like a join
of these records.
Unlike relational tables, rows in column-family databases don’t have to share the same columns—you can add new
columns freely. But creating new column families is rare and more complex.
In Cassandra, rows exist only in one column family, but these families can have supercolumns (nested columns), similar
to Bigtable’s families.
Column families can be:
Skinny: few columns, same structure across rows (like records).
Wide: many columns, often different per row, useful for modeling lists (e.g., order items).
Wide column families can also sort columns, so you can query data ranges (e.g., by date + ID like 20111027-1001).
So, column families give databases a flexible two-dimensional structure—part record-like, part list-like.