0% found this document useful (0 votes)
12 views

Dod Unit4

Uploaded by

Madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Dod Unit4

Uploaded by

Madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT 4

Data Modelling and Aggregation in MongoDB:


Data modeling in MongoDB is essential for designing databases that are
efficient and scalable, tailored to the specific needs of an application.
MongoDB's document-oriented structure differs from the traditional relational
database model. MongoDB stores data as BSON (Binary JSON), which supports
complex, nested data structures.
The key to efficient data modeling in MongoDB is understanding your use case,
access patterns, and balancing between embedded and normalized data
models. Aggregation is the process of performing operations on data like
filtering, grouping, and summarizing to gain insights.
1. Data Models in MongoDB
MongoDB supports flexible data modeling, and it offers two primary types of
data models:
1. Embedded Data Models (Denormalized Data)
2. Normalized Data Models (Referential Data)
The choice of which model to use depends on the structure of the data and
how it is accessed or queried. It’s crucial to model the data based on the
specific use cases to ensure the best performance.
a) Embedded Data Models (Denormalized Data)
In an embedded data model, related data is stored within the same document.
This avoids the need for complex joins, which can improve query performance.
Embedded data models are suitable for one-to-few and one-to-many
relationships where related data is typically accessed together.
Example of an Embedded Data Model:
If you are building a blogging platform, you might embed the comments within
the blog post document:
{
"_id": 1,
"title": "Understanding MongoDB",
"author": "Jane Doe",
"content": "MongoDB is a document-based database...",
"comments": [
{
"author": "John Smith",
"comment": "Great article!",
"date": "2024-10-10"
},
{
"author": "Alice Brown",
"comment": "Very informative.",
"date": "2024-10-09"
}
]
}
In this example, the post's title, author, and content are stored along with the
comments inside a single document. This enables faster reads because the post
and its comments are retrieved in one query.
Advantages of Embedded Data Models:
 Faster Reads: Since the data is denormalized, there is no need for joins,
which improves read performance.
 Atomic Operations: Updates to the entire document (including
embedded data) can be performed in a single atomic operation.
 Convenience: All the data you need is in a single document, which is
easier to manage and retrieve for certain use cases.
Disadvantages of Embedded Data Models:
 Document Size Limitation: MongoDB has a 16 MB document size limit,
so very large embedded structures might not be suitable.
 Redundancy: Data redundancy may occur if embedded data is
duplicated across multiple documents, leading to increased storage
requirements and the potential for inconsistent data.
b) Normalized Data Models (Referential Data)
In a normalized data model, data is split across multiple collections. Documents
reference each other using unique identifiers (_id fields). This model is useful
when different parts of the data are frequently accessed independently or
updated separately.
Example of a Normalized Data Model:
If you have an e-commerce platform, customer and order data can be stored in
separate collections:
 Customer Collection:
{
"_id": 1,
"name": "Jane Doe",
"email": "[email protected]"
}
 Order Collection:
{
"_id": 1001,
"customer_id": 1,
"items": [
{ "product": "Laptop", "price": 1200 },
{ "product": "Mouse", "price": 50 }
],
"order_date": "2024-10-10"
}
Here, customer_id in the Order collection references the customer in the
Customer collection. This allows you to store related data separately and
retrieve it by joining (manually in the application or via queries).
Advantages of Normalized Data Models:
 Data Integrity: Reduces redundancy and ensures that updates to data
are consistent across the system.
 Flexibility: Data is modular and can be updated independently.
 Scalability: Normalized data models can handle complex many-to-many
relationships more easily.
Disadvantages of Normalized Data Models:
 More Complex Queries: Retrieving related data often requires multiple
queries or joins, which may increase query complexity.
 Increased Latency: Performing multiple queries to retrieve data from
different collections can lead to increased latency compared to
embedded models.

Aggregation in MongoDB:
Aggregation in MongoDB refers to the process of transforming, filtering, and
processing data stored in collections to produce computed results. It is a
powerful feature used for various operations such as grouping, filtering,
sorting, and summarizing data. MongoDB's aggregation framework allows for
complex data operations, similar to SQL’s GROUP BY and aggregate functions
like SUM(), COUNT(), and AVG(), but with more flexibility and scalability for
handling large datasets.
MongoDB’s aggregation framework uses a pipeline approach, where data
passes through multiple stages to be transformed or aggregated at each step.
The aggregation pipeline is a sequence of operations applied to the documents
of a collection, similar to how a manufacturing pipeline would work,
transforming the input at each stage.
1. Aggregation Pipeline
The aggregation pipeline in MongoDB is a multi-step process where each stage
in the pipeline receives documents from the previous stage, processes them,
and passes the results to the next stage. This pipeline is both flexible and
efficient, enabling developers to perform complex data transformations directly
within the database.
Key Characteristics:
 Modular: The aggregation pipeline breaks down complex data
operations into discrete stages, making it easier to construct and debug.
 Optimized: MongoDB optimizes the execution of pipelines to ensure
efficient data processing.
 Streamlined: Documents are passed from one stage to the next, allowing
for efficient streaming of data in real-time.
Example of Aggregation Pipeline Syntax:
Here is a basic structure of an aggregation pipeline:
db.collection.aggregate([
{ stage1 },
{ stage2 },
{ stage3 }
])
In this example, stage1, stage2, and stage3 represent different stages in the
aggregation pipeline, where each stage applies some operation to the data.
2. Aggregation Pipeline Stages
Each stage in the aggregation pipeline transforms the documents in some way.
MongoDB provides many powerful stages that can be used to manipulate data.
Common Aggregation Pipeline Stages:
1. $match:
o Filters documents that match specific conditions, similar to the
WHERE clause in SQL.
o This stage is typically used to reduce the number of documents
passed to subsequent stages, improving pipeline efficiency.
Example:
{ $match: { status: "active" } }
Filters documents where the status field is "active".
2. $group:
o Groups documents by a specified field and performs aggregate
operations like summing, averaging, or counting.
o This is similar to the GROUP BY clause in SQL.
Example:
{ $group: { _id: "$category", totalAmount: { $sum: "$amount" } } }
Groups documents by the category field and calculates the total amount for
each category.
3. $project:
o Reshapes the documents by including or excluding specific fields
or creating new fields. It is used to return only the fields needed
for a query.
o Similar to the SELECT clause in SQL, but with the ability to create
calculated fields.
Example:
{ $project: { name: 1, totalCost: { $multiply: ["$price", "$quantity"] } } }
Projects the name field and a new totalCost field, which multiplies the price by
the quantity.
4. $sort:
o Sorts the documents based on one or more fields. It can be used
to arrange documents in ascending or descending order.
Example:
{ $sort: { totalAmount: -1 } }
Sorts documents in descending order based on the totalAmount field.
5. $limit:
o Limits the number of documents returned by the aggregation
pipeline.
Example:
{ $limit: 5 }
Limits the output to the first 5 documents.
6. $unwind:
o Deconstructs an array field from the input documents to output a
document for each element in the array. This is useful when
working with documents containing arrays and you want to treat
each array element as a separate document.
Example:
{ $unwind: "$items" }
Breaks down the items array, outputting one document per item.
7. $lookup:
o Performs a left outer join between two collections. It allows you to
combine documents from one collection with matching
documents from another collection based on a common field.
Example:
{
$lookup: {
from: "orders",
localField: "_id",
foreignField: "customer_id",
as: "orders"
}
}
Joins the orders collection with the current collection using the customer_id
field.
8. $skip:
o Skips a specified number of documents in the pipeline. Often used
in conjunction with $limit to paginate results.
Example:
{ $skip: 10 }
Skips the first 10 documents.
9. $out:
o Writes the results of the aggregation pipeline to a new collection.
This can be useful for saving transformed data for future use.
Example:
{ $out: "transformed_data" }
3. Aggregation Expressions
MongoDB’s aggregation framework provides many expressions for calculations,
data manipulation, and condition evaluation. Expressions can be used within
pipeline stages like $project, $group, or $match.
Common Aggregation Expressions:
 Arithmetic Expressions: Perform mathematical operations like addition,
multiplication, etc.
o Example: { $multiply: [ "$price", "$quantity" ] } computes the
product of the price and quantity fields.
 String Expressions: Perform string manipulations like concatenation,
substring extraction, etc.
o Example: { $concat: [ "$firstName", " ", "$lastName" ] }
concatenates the first name and last name fields.
 Array Expressions: Manipulate arrays within documents.
o Example: { $size: "$items" } returns the size of the items array.
 Conditional Expressions: Allow for conditional logic in the pipeline,
similar to SQL’s CASE statement.
o Example: { $cond: { if: { $gt: [ "$age", 18 ] }, then: "Adult", else:
"Minor" } } returns "Adult" if age is greater than 18, otherwise
returns "Minor".
4. Examples of Aggregation in MongoDB
Here are a few real-world examples that show how to use MongoDB's
aggregation framework:
Example 1: Grouping Data by Category and Calculating Total Sales
db.sales.aggregate([
{ $group: { _id: "$category", totalSales: { $sum: "$amount" } } },
{ $sort: { totalSales: -1 } }
])
This pipeline groups the sales documents by the category field and calculates
the total sales for each category using the $sum operator. It then sorts the
results in descending order by the total sales.
Example 2: Filtering and Aggregating Data
db.customers.aggregate([
{ $match: { "age": { $gte: 18 } } },
{ $group: { _id: "$city", totalCustomers: { $sum: 1 } } },
{ $sort: { totalCustomers: -1 } }
])
This pipeline first filters documents where the age is greater than or equal to
18. Then it groups the customers by their city and counts the total number of
customers in each city. Finally, it sorts the cities by the number of customers in
descending order.
5. Aggregation Framework Performance Optimization
MongoDB offers several strategies to optimize the performance of aggregation
operations, especially when dealing with large datasets.
Tips for Optimizing Aggregation Performance:
1. Use Indexes: Ensure that the fields used in $match, $group, or $sort
stages are indexed. This can drastically improve the performance of
filtering and sorting operations.
2. Filter Early: Use the $match stage early in the pipeline to reduce the
number of documents passed to subsequent stages.
3. Limit Result Set Size: If possible, use $limit to restrict the number of
documents processed by the pipeline, reducing the overall computation
time.
4. Avoid Expensive Operations: Minimize the use of stages like $unwind or
$lookup, as they can be resource-intensive when working with large
datasets.
6. Use Cases of Aggregation
MongoDB's aggregation framework is versatile and can be used for many real-
world use cases:
 E-commerce: Analyzing product sales, calculating revenue, or
determining top-selling categories.
 Social Media: Counting the number of posts or interactions by user, or
filtering and ranking content based on user activity.
 Analytics: Summarizing log data, generating reports, or performing trend
analysis on large datasets.

Data Model Relationship Between Documents:


MongoDB, unlike relational databases, does not enforce relationships between
collections (tables in SQL terms). However, in real-world applications, managing
relationships between data is essential, especially when working with complex
systems. In MongoDB, there are two main strategies to model relationships
between documents: embedding and referencing.
In MongoDB, relationships between documents (the equivalent of rows in SQL
databases) can be modeled in various ways, depending on the nature of the
data and the relationships between entities. MongoDB offers flexibility in how
we structure data because it’s a NoSQL, schema-less database. This flexibility
enables developers to choose how data should be stored, based on how it will
be accessed, updated, and managed.
There are two main ways to model relationships between documents in
MongoDB:
 Embedding: Embedding one document inside another document.
 Referencing: Storing the reference (ObjectID) of one document inside
another document.
Both approaches have their strengths and weaknesses, and the decision on
which to use depends on the specific use case, the relationships between the
data, and the performance requirements of the system.
Example of Document Relationships:
Let’s consider a simple e-commerce application with two collections: Users and
Orders. The relationship between these collections can be modeled in two
ways: embedding the order details inside the user document or referencing the
order documents with the userId in the Orders collection.
We’ll dive into both embedding and referencing as strategies for managing
relationships in MongoDB.

1. Data Model Using an Embedded Document


Embedded documents are a fundamental feature of MongoDB. In this model,
one document contains another document, effectively storing related
information within the same document. This approach creates a denormalized
data structure where related data is stored together, minimizing the need for
multiple database lookups.
Key Characteristics of Embedded Documents:
 Embedded documents allow MongoDB to store and retrieve related data
in a single database call, improving read performance.
 This model is ideal for 1-to-1 or 1-to-few relationships where the related
data is frequently accessed together.
 It simplifies the data model and is atomic—meaning all data inside the
document can be modified in a single operation, ensuring consistency.
Example:
Consider an Orders document in an e-commerce application:
{
"_id": 1,
"userId": 123,
"customerName": "John Doe",
"items": [
{
"productId": 1001,
"productName": "Laptop",
"quantity": 1,
"price": 1000
},
{
"productId": 1002,
"productName": "Mouse",
"quantity": 2,
"price": 50
}
],
"shippingAddress": {
"street": "123 Maple St",
"city": "New York",
"state": "NY",
"postalCode": "10001"
}
}
In this example:
 The items array contains sub-documents that represent each product in
the order.
 The shippingAddress is another embedded document, containing fields
for the address.
 All related data (order details, items, and address) are embedded in one
document.
Advantages of Using Embedded Documents:
1. Performance: Embedding data allows MongoDB to retrieve everything in
a single query, which is useful when related data is always required
together. For example, in a shopping cart or order page, you may want to
show both order details and product information.
2. Atomic Operations: Since all the related data is stored in a single
document, you can update, delete, or insert data atomically. For
instance, modifying an item in the items array or changing the shipping
address is part of a single document operation, which simplifies
transactional requirements.
3. Data Locality: Data that is used together is stored together, leading to
better performance on read-heavy workloads.
4. Simple Structure: Embedding simplifies queries by eliminating the need
for complex joins. This makes data retrieval faster and easier for
developers to work with.
Disadvantages of Embedded Documents:
1. Document Size Limit: MongoDB has a limit of 16 MB per document. If
embedded data grows too large, it can exceed the document size limit,
leading to potential issues with scalability.
2. Duplication: If the same embedded data is used across multiple
documents, you can end up with data duplication. This can make
updates more complex as you must update every instance of the
duplicated data.
3. Update Complexity: Updating specific parts of an embedded document
may require reading and rewriting the entire document, which can
become inefficient if only a small part needs to change.
When to Use Embedding:
 Use embedding when there is a 1-to-1 or 1-to-few relationship, and the
data is often accessed together.
 Embedding is ideal when the embedded data does not grow large or
frequently change independently of the main document.

2. Data Model Using Document References


In contrast to embedding, referencing stores the related data in separate
documents, and the relationship is maintained by storing the ObjectID of one
document in another. This approach is similar to foreign key relationships in
SQL databases.
Key Characteristics of Document References:
 Document references support normalized data, where related data is
stored separately, reducing data duplication.
 This model is ideal for 1-to-many or many-to-many relationships where
the related data might grow significantly over time or needs to be shared
across multiple collections.
Example:
Let’s extend our previous example of the Users and Orders collections. Instead
of embedding the order data in the user document, we can store the orders in
a separate collection and reference the user by userId:
 Users Collection:
{
"_id": 123,
"name": "John Doe",
"email": "[email protected]"
}
 Orders Collection:
{
"_id": 1,
"userId": 123, // Reference to the Users collection
"items": [
{
"productId": 1001,
"productName": "Laptop",
"quantity": 1,
"price": 1000
},
{
"productId": 1002,
"productName": "Mouse",
"quantity": 2,
"price": 50
}
],
"shippingAddress": {
"street": "123 Maple St",
"city": "New York",
"state": "NY",
"postalCode": "10001"
}
}
In this case:
 The userId field in the Orders collection refers to the _id field in the
Users collection, creating a reference between the two.
 To retrieve the user information for an order, two queries are required:
one for the Orders collection and another for the Users collection.
Advantages of Using References:
1. Data Normalization: By using references, you can avoid duplication.
Instead of embedding the same user information across multiple
documents, you store it once in the Users collection and reference it in
related collections like Orders.
2. Scalability: Since each document is stored separately, you avoid the
16MB document size limit. This makes referencing ideal for relationships
where one document is associated with many other documents.
3. Flexibility: With referencing, different pieces of related data can evolve
independently. For example, you can update a user’s email address
without having to update every order they have placed.
Disadvantages of Using References:
1. Complex Queries: Unlike embedding, referencing requires multiple
queries to retrieve related data. In the above example, to get user details
along with an order, two separate queries (or a join in SQL terms) must
be made. This adds complexity to your application logic and can reduce
performance.
2. Consistency: Since the data is stored in separate documents, ensuring
data consistency across multiple related documents becomes more
challenging, especially in distributed systems.
3. Write Performance: When related data is updated frequently, updating
references may become more complex. For example, if user data
changes and is referenced across multiple collections, keeping those
references up-to-date can become a challenge.
When to Use References:
 Use referencing when there is a 1-to-many or many-to-many
relationship, and when related data is updated independently.
 Referencing is suitable when you need to maintain a normalized data
structure and avoid document size limits.

4. Choosing Between Embedding and Referencing


MongoDB’s flexible schema allows developers to choose between embedding
and referencing based on the needs of their application. The decision typically
comes down to:
 How frequently the data is accessed together: If the data is always
accessed together, embedding is a better option. If not, referencing can
keep your data normalized and reduce duplication.
 Size of the data: If embedding leads to large document sizes (close to
16MB), referencing is a better option.
 Update Patterns: If parts of the embedded document change frequently,
it might be better to separate them using references.
A Hybrid Approach:
In many cases, the best solution is to combine both embedding and
referencing. For example, frequently accessed data can be embedded, while
larger or less frequently used data can be referenced. This provides the
benefits of both approaches.
Example of Hybrid Model: An order might embed the product details (since
they are frequently accessed together) while referencing the user’s profile
(which is shared across many orders and doesn’t change often).
{
"_id": 1,
"userId": 123, // Reference to Users collection
"items": [
{
"productId": 1001,
"productName": "Laptop",
"quantity": 1,
"price": 1000
}
],
"shippingAddress": {
"street": "123 Maple St",
"city": "New York",
"state": "NY",
"postalCode": "10001"
}
}
In this hybrid approach:
 The items field is embedded because it is part of the order and accessed
with the order.
 The userId is a reference because user data may be accessed
independently, and the same user can have multiple orders.

You might also like