UNIT 3 - MongoDB Explained
Compiled by Ms. Archana Patil , Asst. Professor, ALSJ. Ref : Practical MongoDB - Shakuntala Gupta Edward
* MongoDB uses MMAP as its default storage engine. This engine
works with memory-mapped files .
* Memory-mapped files are data files that are placed by the
operating system in memory using the mmap() system call.
* mmap is a feature of OS that maps a file on the disk into virtual
memory.
* MongoDB uses memory-mapped files for any data interaction or
data management activity. As and
* when the documents are accessed, the data files are memory
mapped to the memory. MongoDB allows
* the OS to control the memory mapping and allocate the
maximum amount of RAM.
*
* With the release of version 3.0, MongoDB comes along with a pluggable
storage engine API wherein it enables you to select between the storage
engines based on the workload, application need, and available
infrastructure.
* The vision behind the pluggable storage engine layer is to have one data
model, one querying language, and one set of operational concerns, but
under the hood many storage engine options optimized for different use
cases
The pluggable storage engine feature also provides flexibility in terms of deployments
wherein multiple types of storage engines can coexist in the same deployment.
* MongoDB version 3.0 ships with two storage engines.
* The default, MMAPv1, is an improved version of the MMAP
engine used in the prior versions. The updated MongoDB
MMAPv1 storage engine implements collection-level
concurrency control. This storage engine excels at workloads
with high volume reads, inserts, and in-place updates.
* The new WiredTiger storage engine was developed by the
architects of Berkeley DB, the most widely deployed
embedded data management software in the world.
* WiredTiger scales on modern multi-CPU architectures. It is
designed to take advantage of modern hardware with multi-
core CPUs and more RAM.
* WiredTiger stores data in compressed fomat on the disk.
Compression reduces the data size by upto 70% (disk only)
and index size by up to 50% (disk and memory both)
depending on the compression algorithm used.
*
Namespace (.ns File)
* Within the data files you have data space divided into
namespaces , where the namespace can correspond to either
a collection or an index.
* The metadata of these namespaces are stored in the .ns file.
If you check your data directory, you will find a file named
[dbname].ns .
* The size of the .ns file that is used for storing the metadata
is 16MB
Indexes Btree
* The BTree structure is used for storing the indexes. A BTree
is shown in Figure
*
* When the storage option selected is WiredTiger, data, journals,
and indexes are compressed on disk.
* The compression is done based on the compression algorithm
specified when starting the mongod.
* Snappy is the default compression option.
* Under the data directory, there are separate compressed wt
files corresponding to each collection and indexes.
* Journals have their own folder under the data directory.
* The compressed files are actually created when data is
inserted in the collection (the files are allocated at write time,
no preallocation).
* MongoDB disk writes are lazy, which means if there are 1,000
increments in one second, it will only be written once. The
physical writes occurs a few seconds after the operation.
* In the MongoDB system, mongod is the primary daemon
process. So the disk has the data files and the journal files.
*
* When the mongod is started, the data files are mapped to a shared view.
In other words, the data file is mapped to a virtual address space.
* If data file is 2000 bytes on disk, so MongoDB maps this to memory
address1,000,000 – 1,002,000.
* Note that the data will not be actually loaded until accessed; the OS just
maps it and keeps it.
* Until now you still had files backing up the memory. Thus any change in
memory will be flushed to the underlying files by the OS.
* This is how the mongod works when journaling is not enabled. Every 60
seconds the in-memory changes are flushed by the OS.
*
In this scenario, let’s look at writes with journaling enabled. When journaling is
enabled, a second mapping is made to a private view by the mongod.
That’s why the virtual memory amount used by mongod doubles when the
journaling is enabled.
The data file is not directly connected to the private view, so the
changes will not be flushed from the private view to the disk by the OS.
*
When a write operation is initiated it, first it writes to the private view
Next, the changes are written to the journal file, appending a brief description
of what’s changed in the files
The journal keeps appending the
change description as and when it
gets the change.
If the mongod fails at this point, the
journal can replay all the changes
even if the data file is not yet
modified, making the write safe at
this point.
*
* The journal will now replay the logged changes on the shared view.
Finally, with a very fast speed the changes are written to the disk. By default,
the OS is requested to do this every 60 seconds by the mongod.
*
* In the last step, the shared view is remapped to the private
view by the mongod. This is done to prevent the private
view from getting too dirty
*
GridFS
By design, a MongoDB document (i.e. a BSON object) cannot be larger than 16MB.
This is to keep performance at an optimum level, and the size is well suited for
our needs. For example, 4MB of space might be sufficient for storing a sound clip
or a profile picture. However, if the requirement is to store high quality audio or
movie clips, or even files that are more than several hundred megabytes in size,
MongoDB covered it by using GridFS.
* GridFS specifies a mechanism for dividing a large file among multiple documents.
* GridFS uses two collections for storing the file. One collection maintains the metadata
of the file and the other collection stores the file’s data by breaking it into small pieces
called chunks. This means the file is divided into smaller chunks and each chunk is
stored as a separate document. By default the chunk size is limited to 255KB.
* This approach not only makes the storing of data scalable and easy but also makes the
range queries easier to use when a specific part of files are retrieved.
* Whenver a file is queried in GridFS, the chunks are reassembled as required by the
client. This also provides the user with the capability to access arbitrary sections of
the files. For example, the user can directly move to the middle of a video file.
* The GridFS specification is useful in cases where the file size exceeds the default
16MB limitation of MongoDB BSON document. It’s also used for storing files that
you need to access without loading the entire file in memory.
Indexing
An index is a data structure that speeds up the read
operations.
In layman terms, it is comparable to a book index where
you can reach any chapter by looking in the index for the
chapter and jumping directly to the page number rather
than scanning the entire book to reach to the chapter,
which would be the case if no index existed.
Similarly, an index is defined on fields, which can help in
searching for information in a better and efficient
manner.
1. _id index
This is the default index that is created on the id field. This index
cannot be deleted.
2. Secondary Indexes
All indexes that are user created using ensureIndex() in MongoDB
are termed as secondary indexes.
A. These indexes can be created on any field in the document or
the sub document. Let’s consider the following document:
* {"_id": ObjectId(...), "name": "Practical User", "address":
{"zipcode": 201301, "state": "UP"}}
In this document, an index can be created on the name field as
well as the state field.
*
b. These indexes can be created on a field that is holding a sub-document.
If you consider the above document where address is holding a sub-
document, in that case an index can be created on the address field as
well.
c. These indexes can either be created on a single field or a set of fields.
When created with set of fields, it’s also termed a compound index .
d. If the index is created on a field that holds an array as its value, then a
multikey index is used for indexing each value of the array separately.
Consider the following document:
{ "_id" : ObjectId("..."),"tags" : [ "food", "hot", "pizza", "may" ] }
An index on tags is a multikey index, and it will have the following entries:
{ tags: "food" }
{ tags: "hot" }
{ tags: "pizza" }
{ tags: "may" }
Behaviors and Limitations
* Finally, the following are a few behaviors and limitations that you need to
be aware of:
* More than 64 indexes may not be allowed in a collection.
* • Index keys cannot be larger than 024 bytes .
* • A document cannot be indexed if its fields’ values are greater than this
size.
* • The following command can be used to query documents that are too
large to index:
* db.practicalCollection.find({<key>: <too large to index>}).hint({$natural: 1})
* • An index name (including the namespace ) must be less than 128
characters .
* • The insert/update speeds are impacted to some extent by an index.
* • Do not maintain indexes that are not used or will not be used.
* • Since each clause of an $or query executes in parallel, each can use a
different index.
* • The queries that use the sort () method and the $or operator will not be
able to use
* the indexes on the $or fields.
* • Queries that use the $or operator are not supported by the second
geospatial query .
MongoDB Space Is Too Large (Applicable for MMAPv1)
MongoDB (with storage engine MMAPv1 ) space is too large; in other
words, the data directory files are larger than the database’s actual
data. This is because of preallocated data files. This is by design in
order to prevent file system fragmentation.
The files in the data directory are named as <dbname>.0,
<dbname>.1 and so on. The size of the first file as allocated by the
mongod is 64MB; all subsequent file sizes increase by factor of 2, so
the second file will128MB, the third file will be 256MB, and so on
until it reaches 2GB, post which all files will be 2GB in size.
Though the space is allocated to the data files while creation, there
might be files that are 90% empty.
*
Memory Issues (Applicable for Storage Engine MMAPv1)
* In MongoDB, memory is managed by memory mapping the entire data
set. It allows the OS to control the memory mapping and allocate the
maximum amount of RAM.
The result is that the performance is non-optimal and the memory
usage cannot be effectively reasoned about.
1. Indexes are memory-heavy; in other words, indexes take up lot of
RAM. Since these are B-tree indexes, defining many indexes can
lead to faster consumption of system resources.
2. A consequence of this is that memory is allocated automatically
when required. In a shared environment, it’s trickier to run the
database. In general, as with all database servers, it’s best to run
MongoDB on a dedicated server.
*
32-bit vs. 64-bit
MongoDB comes with two versions, 32-bit and 64-bit.
Since MongoDB uses memory mapped files, the 32-bit versions are limited
to storing only about 2GB of data.
If you need more data to be stored, you should use the 64-bit build.
Starting from version 3.0, commercial support for 32-bit versions is no
longer provided by MongoDB.
Also, the 32-bit version of MongoDB does not support the WiredTiger
storage engine.
*
BSON Documents
This section covers the limitations of BSON documents .
Size limits : As with other databases, there’s a limit to what can be stored in the
document. The current versions support documents up to 16MB in size. This
maximum size ensures that a document cannot not use excessive RAM or
excessive bandwidth while in transmission.
Nested depth limit : In MongoDB, no more than 100 levels of nesting are
supported for BSON documents.
Field names : If you store 1,000 documents with the key “col1”, the key is stored
that many times in the data set. Although arbitrary documents are supported in
MongoDB, in practice most of the field names are the same. Keeping short field
names is considered a good practice for optimizing the usage of space.
*
Namespaces Limits
Be aware of the following limitations from the namespace
perspective.
• Length of a namespace : The length of each namespace
including collection and database name must be smaller than
123 bytes.
• Namespace file size (applicable for the MMAPv1 storage
engine): A namespace file size cannot be greater than 2047MB.
The default size is 16MB; however , this can be configured using
the nssize option.
• Number of namespaces (applicable for the MMAPv1 storage
engine): Number of namespace = (namespace file size/628). A
namespace file of 16MB will support approximately 24,000
namespaces.
*
Indexes Limit
This section covers the limitations of indexing in MongoDB.
• Index size : Indexed items cannot be greater than 1024 bytes.
• Number of indexes per collection : At the most 64 indexes are allowed per
collection.
• Index name length : By default the index name is made up of the field names
and the index directions. The index name including the namespace (which is the
database and the collection name) cannot be greater than 128 bytes. If the
default index name is becoming too long, you can explicitly specify an index name
to the ensureIndex() helper.
• Unique indexes in sharded collections : Only when the full shard key is
contained as a prefix of the unique index is it supported across shards; otherwise,
the unique index is not supported across shards. In this case, the uniqueness is
enforced across the full key and not a single field.
• Number of indexed fields in a compound index : This can’t be more than 31
fields.
*
Capped Collections Limit
Maximum Number of Documents in a Capped Collection
If the max parameter is used for specifying the maximum
number of documents in a capped collection, it can’t be
more than 232 documents. However, if no such parameter is
used, there’s no limit on the number ofdocuments.
*
Sharding Limitations
Sharding is the mechanism of splitting data across shards. The following sections talk about the limitations that you
need to be aware of when dealing with sharding.
Shard Early to Avoid Any Issues
Using the shard key, the data is split into chunks, which are then automatically distributed amongst the shards.
However, if sharding is implemented late, it can cause slowdowns of the servers because the splittiand migration of
chunks takes time and resources.
A simple solution is to monitor your MongoDB instance capacity using tools such as MongoDB Cloud Manager
(flush time, lock percentages, queue lengths, and faults are good measures) and shard before reaching 80% of the
estimated capacity.
Shard Key Can’t Be Updated
The shard key can’t be updated once the document is inserted in the collection because MongoDB uses shard keys to
determine to which shard the document should be routed. If you want to change the shard key of a document, the
suggested solution is to remove the document and reinsert the document when he change has been made.
Shard Collection Limit
The collection should be sharded before it reaches 256GB.
Select the Correct Shard Key
It’s very important to choose a correct shard key because once the key is chosen it’s not easy to correct it.
*
Security Limitations
Security is an important matter when it comes to databases.
No Authentication by Default
Although authentication is not enabled by default, it’s fully
supported and can be enabled easily.
Traffic to and from MongoDB Isn’t Encrypted
By default the connections to and from MongoDB are not
encrypted. When running on a public network, consider
encrypting the communication; otherwise it can pose a
threat to your data. Communications on a public network
can be encrypted using the SSL-supported build of MongoDB,
which is available in the 64-bit version only.
*
Write and Read Limitations
Case-Sensitive Queries
By default MongoDB is case sensitive.
Type- Sensitive Fields
Since there’s no enforced schema for documents in MongoDB, it can’t know you are
making a mistake. You must make sure that the correct type is used for the data.
No JOIN
Joins are not supported in MongoDB. If you need to retrieve data from more than one
collection, you must do more than one query.
Transactions
MongoDB only supports single document atomicity. Since a write operation can modify
multiple documents, this operation is not atomic. However, you can isolate write
operations that affect multiple documents using the isolation operator.
Replica Set Limitations - Number of Replica Set Members
A replica set is used to ensure data redundancy in MongoDB. One member acts as a
primary member and the rest act as secondary members. Due to the way voting works
with MongoDB, you must use an odd number of members
*
MongoDB Not Applicable Range
MongoDB is not suitable for the following:
1. Highly transactional systems such as accounting or banking
systems. Traditional RDBMS are still more suitable for such
applications, which require a large number of atomic complex
matters.
2. Traditional business intelligence applications, where an
issue-specific BI database would generate highly optimized
queries. For such applications, the data warehouse may be a
more appropriate choice.
3. Applications requiring complex SQL queries.
4. MongoDB does not support transactional operations, so a
banking system certainly cannot use it.