As AI applications increasingly rely on understanding meaning rather than simply matching keywords, vector search has become a key technique for modern, data-driven systems. Instead of comparing text directly, vector search represents data as numerical embeddings and finds items that are semantically similar based on their vector distance.
1. Understanding Vector Search
To better understand vector search, let’s first look at what a vector actually is. A vector is simply a list of numbers that represents specific features of an object, for example, the meaning of a sentence, the style of an image, or the characteristics of a product. These numbers are generated by machine learning models that convert complex data like text or images into embeddings, which are numerical representations that capture their deeper meaning.
When we perform a vector search, we are not looking for exact word matches. Instead, we compare these numerical embeddings to find items that are similar in meaning. The closer two embeddings are in this multi-dimensional space, the more semantically similar they are.
In traditional keyword search, results are matched only by the exact words used. For example, if you search for "Cat sits on mat", a keyword-based system might return sentences such as "A cat sits on a soft mat" or "The cat sits on the mat near the window" because they share the same words. However, it would miss similar sentences like "Kitten sleeps" even though the meaning is closely related.
Vector search works differently. It focuses on finding sentences with similar meanings rather than identical wording. If you search for "Cat sits on mat", the system could also find results such as "Kitten sleeps on the rug" or "A small pet rests on the carpet." Even though these examples use different words, the AI model recognizes that their meanings are related.
This happens because the AI model converts text into numerical vectors called embeddings that capture the underlying concepts. These vectors can then be placed in a multi-dimensional space, where their positions reflect how similar their meanings are. Texts with related meanings, such as "cat" and "kitten," appear close to each other, while unrelated concepts like "dog" are positioned farther away.
2. Exploring Vector Search in MongoDB
Now that we understand the idea behind vector search, let’s look at how MongoDB puts it into practice. Vector search is available in MongoDB Community Edition, Enterprise Advanced, and Atlas.
MongoDB allows you to store numerical embeddings directly inside your documents. Each embedding is saved as an array of numbers that represents the meaning of a piece of text, an image, or any other type of data you want to search through. However, MongoDB itself cannot transform words, sentences, or images into embeddings. To create these embeddings, you need to use an external AI model or service, such as Voyage AI, which generates the vector representations that MongoDB can then index and query.
To make this type of search efficient, MongoDB uses a vector index powered by Apache Lucene. Apache Lucene is a high-performance, open-source search library written in Java that provides advanced text and vector search capabilities. It allows MongoDB to efficiently store, organize, and compare high-dimensional embeddings for rapid similarity computation.
The vector index arranges stored vectors in a way that makes it fast to find those closest to your query vector. When a similarity query is executed, MongoDB calculates the distance between the query vector and the stored embeddings, returning the documents that are most semantically related.
This combination of flexible data storage, Lucene-powered vector indexing, and efficient similarity search enables MongoDB to support intelligent features such as semantic search, recommendation engines, and generative AI applications, all within a unified database environment.
3. Creating your Local MongoDB Atlas Cluster via Atlas CLI
Now, let's move on to practice. We will need a MongoDB database with vector search enabled. The easiest way to set this up is by using the Atlas CLI. You'll also need mongosh, the MongoDB Shell, to interact with your database from the command line.
On macOS, you can install both Atlas CLI and mongosh using Homebrew:
brew tap mongodb/brew
brew install mongodb-atlas
brew install mongosh
On Linux, installation is just simple:
curl -s https://mongodb.dev/cli | bash
Once installed, you can verify the installations:
atlas --version
mongosh --version
Note: Atlas CLI uses Docker internally to create local MongoDB environments that replicate the Atlas runtime. Before running atlas deployments setup, make sure that Docker is installed and the Docker daemon is running. Without it, local deployments won’t work.
Once Docker is running, launch the interactive setup:
atlas deployments setup
Choose the local option, accept the defaults, and specify a port (e.g., 27017). The CLI will spin up a containerized MongoDB 8.0 replica set with Atlas-compatible features. Now, you can list active deployments:
atlas deployments list
NAME TYPE MDB VER STATE
local813 LOCAL 8.0.11 IDLE
3.1 Connecting to the Local Atlas Deployment
To connect to your local deployment, simply run:
atlas deployments connect
You’ll be prompted to choose how you want to connect.
For example:
? How would you like to connect to local813?
> mongosh - MongoDB Shell
compass - MongoDB Compass
vscode - MongoDB for VSCode
connectionString - Connection String
Selecting mongosh will launch an interactive session connected to your local MongoDB replica set. You can now run queries, create indexes, test aggregations, or explore features like Atlas Search and Vector Search.
AtlasLocalDev local813 [direct: primary] test> show dbs
admin 256.00 KiB
config 232.00 KiB
local 588.00 KiB
AtlasLocalDev local813 [direct: primary] test>
3.2 Loading a Sample Dataset into the Local Atlas Cluster
As you can see above, only three databases were created: admin, config, and local. The admin database is used for administrative commands, config stores metadata for sharded clusters, and local holds data specific to the local instance, such as replication information.
To start performing vector searches in MongoDB, you need to generate vector embeddings from text using AI models such as Voyage AI and insert them into your collection, or use pre-generated vectors available in the MongoDB Sample Dataset. Let’s load that dataset into a local Atlas cluster using the mongorestore tool.
mongorestore is a MongoDB command-line tool used to import data into a database from a backup created with mongodump or from an archive file. It restores collections, databases, and entire datasets into a running MongoDB instance.
On macOS, you can simply run:
brew install mongodb-database-tools
On Linux (RHEL-based distributions):
sudo yum install -y mongodb-database-tools-*-100.13.0.rpm
Now, download the sample dataset using curl, as shown in the example below:
curl https://atlas-education.s3.amazonaws.com/sampledata.archive -o sampledata.archive
If the above command curl runs successfully, you should have the sampledata.archive file on your disk.
Next, find the connection string for your local Atlas cluster with:
atlas deployments connect --connectWith connectionString
You will get a connection string similar to "mongodb://localhost:55015/?directConnection=true". Then, load the sample dataset using the mongorestore tool and the connection string, like shown below:
mongorestore --archive=sampledata.archive --uri "mongodb://localhost:55015/?directConnection=true"
Note that you're providing sampledata.archive as the input file.
Restoring the sample dataset should complete within a few minutes.
After reconnecting to your local Atlas cluster, run show dbs to confirm that the new sample_mflix database has been added. It includes the embedded_movies collection with pre-generated vector embeddings from the MongoDB sample dataset.
3.3 Finding Embeddings
Now, after loading the sample dataset, when you run the command show dbs, you will see the new databases appear, as shown below:
C++
AtlasLocalDev local813 [primary] test> show dbs
sample_airbnb 67.26 MiB
sample_analytics 18.90 MiB
sample_geospatial 2.00 MiB
sample_guides 72.00 KiB
sample_mflix 162.15 MiB
sample_restaurants 10.38 MiB
sample_supplies 1.85 MiB
sample_training 75.65 MiB
sample_weatherdata 4.32 MiB
admin 356.00 KiB
local 34.27 GiB
AtlasLocalDev local813 [primary] test>
Switch to the sample_mflix database, using use sample_mflix as shown below, and take note of the embedded_movies collection:
AtlasLocalDev local813 [primary] sample_mflix> use sample_mflix
already on db sample_mflix
AtlasLocalDev local813 [primary] sample_mflix> show collections
comments
embedded_movies
movies
sessions
theaters
users
AtlasLocalDev local813 [primary] sample_mflix>
This collection includes information about movies that belong to the Western, Action, or Fantasy genres. Each document represents a single movie and contains details such as the movie’s title, release year, and cast.
Additionally, the documents in this collection feature the following fields:
- plot_embedding_voyage_3_large: Stores 2048-dimensional embeddings generated from the plot field using Voyage AI’s voyage-3-large embedding model. The data is stored as binData for efficient storage and retrieval.
- plot_embedding: Stores 1536-dimensional embeddings generated from the plot field using OpenAI’s text-embedding-ada-002 model, also converted to binData for efficient storage and retrieval.
binData (binary data) is a special data type in MongoDB used to store binary-encoded information, such as images, files, or numerical vectors like embeddings. The embeddings are stored as binData to make them more compact and efficient for storage and retrieval. Instead of saving them as long lists of numbers, MongoDB saves the binary representation, which reduces storage size and improves query performance. It is especially useful when working with large vector data.
These embeddings were previously generated using the corresponding embedding models—Voyage AI’s voyage-3-large and OpenAI’s text-embedding-ada-002—and then stored in the collection within the MongoDB sample dataset.
This means that for testing and exploration purposes, you don’t need to create or compute these embeddings manually. They are already included as part of the dataset, allowing you to focus on experimenting with queries, similarity searches, and other vector operations without worrying about data preparation.
To retrieve a document from the embedded_movies collection within this database, run the following command:
db.getSiblingDB("sample_mflix").embedded_movies.findOne()
This command queries the sample_mflix.embedded_movies namespace and returns a single document containing standard movie metadata such as title, cast, genres, and release date. It also includes one or more vector embeddings of the plot field, which are stored as Float32Array binaries.
Here is a simplified example of the returned document:
C++
{
"_id": ObjectId("573a1392f29313caabcd9ca6"),
"title": "Scarface",
"plot": "An ambitious and near insanely violent gangster climbs the ladder of success...",
"plot_embedding": Binary.fromFloat32Array(new Float32Array([
-0.0155, -0.0342, 0.0152, -0.0426, -0.0208, 0.0263,
// ... 1436 more values ...
])),
"plot_embedding_voyage_3_large": Binary.fromFloat32Array(new Float32Array([
-0.0300, 0.0311, -0.0156, -0.0366, 0.0248, 0.0085,
// ... 1948 more values ...
]))
}
The example includes two different embeddings of the same plot: plot_embedding contains 1536-dimensional vectors generated using OpenAI’s text-embedding-ada-002 model, while plot_embedding_voyage_3_large contains 2048-dimensional vectors from Voyage AI’s voyage-3-large model.
Now, you just need to create a vector index on the embedding field, and you’ll be ready to perform semantic search.
3.4 Building a Vector Index
The final step is creating a vector search index, which enables efficient similarity searches between stored embeddings in the database. Use the createSearchIndex command to define a vector index on the plot_embedding field. This enables fast similarity search over 1536-dimensional vectors. The createSearchIndexes command triggers an index build.
C++
db.getSiblingDB("sample_mflix").embedded_movies.createSearchIndex({
name: "plot_embedding_index",
definition: {
mappings: {
dynamic: false,
fields: {
plot_embedding: {
type: "knnVector",
dimensions: 1536,
similarity: "cosine"
}
}
}
}
})
Let’s take a look at the command above. The parameter dynamic: false means that only explicitly defined fields are indexed. Setting it to true automatically indexes all fields, but false is recommended for performance when you know your schema.
The parameter dimensions: 1536 must match the output size of the embedding model. For example, OpenAI ada-002 has 1536 and Voyage-3-large has 2048 different models produce embeddings with different dimensionalities.
During the index build, MongoDB reads all values from the plot_embedding field and converts them into Lucene’s internal vector format. Once the initial index is built, MongoDB keeps the Apache Lucene index synchronized with the collection using an internal change stream mechanism. This means that any document inserts, updates, or deletions are automatically captured and applied to the Lucene index in near real time, ensuring that search results always reflect the latest state of the data.
The plot_embedding field is indexed as a knnVector, a specialized vector field designed for storing high-dimensional numeric data. Cosine means the similarity between vectors is based on the angle between them, where a smaller angle indicates higher similarity regardless of magnitude. Internally, the knnVector index uses an Approximate Nearest Neighbor (ANN) algorithm based on Lucene’s Hierarchical Navigable Small World (HNSW) graph, which accelerates similarity searches by approximating the nearest vectors instead of computing exact distances for every document.
To confirm the index exists, run:
db.getSiblingDB("sample_mflix").embedded_movies.getSearchIndexes()
You're now ready to run similarity queries against this field using the $vectorSearch operator. st this field using the $vectorSearch operator.
4. Understanding the $vectorSearch Operator
The $vectorSearch stage in MongoDB is an aggregation operator that allows you to perform Approximate Nearest Neighbor (ANN) or Exact Nearest Neighbor (ENN) searches within your collections. This means that $vectorSearch enables you to find items in your database that are similar in meaning to a given example, not just those that match keywords exactly.
ANN and ENN are two algorithms used to identify the most similar items in your data.
ANN finds similar items faster, with a small trade-off in accuracy, while ENN produces exact matches but is typically slower.
You can also combine similarity search with regular MongoDB filters such as $eq for equality, $gte for greater than or equal, or $lt for less than. This makes it possible to search for semantically related results while still applying additional filters like category, date, or price.
The $vectorSearch stage takes a vector as an input parameter.
You cannot pass a text string such as "cat sitting on a mat" directly into $vectorSearch, because MongoDB does not automatically convert text into vectors.
The embeddings must be generated externally using a machine learning model such as OpenAI’s text-embedding-ada-002 or Voyage AI’s voyage-3-large, which transform text into numerical representations that can be compared mathematically.
However, for this tutorial, there is no need to generate your own embeddings.
We can use the MongoDB sample dataset, which already includes precomputed vector fields created with these models. One of these embeddings can be used directly as the query input parameter for $vectorSearch.
4.1 Running a Vector Similarity Search
Let’s say we want to use the embedding of the plot field as the $vectorSearch input parameter to find movies with plot descriptions similar to Scarface in the embedded_movies collection.
In the embedded_movies collection, one of the documents represents the movie Scarface. You can view it using the following command:
db.getSiblingDB("sample_mflix").embedded_movies.find({ title: "Scarface" })
The document includes standard fields such as title, plot, runtime, fullplot, and genres, along with two vector embeddings: plot_embedding and plot_embedding_voyage_3_large.
C++
{
_id: ObjectId('573a1392f29313caabcd9ca6'),
plot: 'An ambitious and near insanely violent gangster climbs the ladder of success in the mob, but his weaknesses prove to be his downfall.',
genres: [ 'Action', 'Crime', 'Drama' ],
runtime: 93,
rated: 'PASSED',
cast: [ 'Paul Muni', 'Ann Dvorak', 'Karen Morley', 'Osgood Perkins' ],
num_mflix_comments: 1,
poster: 'https://m.media-amazon.com/images/M/MV5BYmMxZTU2ZDUtM2Y1MS00ZWFmLWJlN2UtNzI0OTJiOTYzMTk3XkEyXkFqcGdeQXVyMjUxODE0MDY@._V1_SY1000_SX677_AL_.jpg',
title: 'Scarface',
fullplot: "Johnny Lovo rises to the head of the bootlegging crime syndicate on the south side of Chicago following the murder of former head, Big Louis Costillo. Johnny contracted Big Louis' bodyguard, Tony Camonte, to make the hit on his boss. …..
…….
"plot_embedding": [-0.0065, -0.0334, -0.0149, -0.0390, -0.0114, 0.0089, -0.0314, -0.01881, -0.0534,-0.0734, -0.016608...],
"plot_embedding_voyage_3_large": [-0.0376, 0.0339, -0.0164, -0.0154,-0.0134,-0.5164, -0.0371, -0.01881, -0.016608, 0.0920, 0.0474, ...]
}
The plot_embedding and plot_embedding_voyage_3_large fields are vector representations generated from the plot field. The plot_embedding field contains a 1536-dimensional vector created using OpenAI’s text-embedding-ada-002 model, while plot_embedding_voyage_3_large stores a 2048-dimensional vector produced by Voyage AI’s voyage-3-large model.
Let’s use the plot_embedding field, generated by OpenAI’s text-embedding-ada-002 model. To begin, extract the embedding vector from the Scarface document, store it in a variable, and use it as our query vector:
const scarfaceEmbedding = db.getSiblingDB("sample_mflix").embedded_movies.findOne(
{ title: "Scarface" },
{ plot_embedding: 1, _id: 0 }
).plot_embedding
Once retrieved, the embedding is stored in the scarfaceEmbedding variable. This variable holds the 1536-dimensional vector that numerically represents the semantic meaning of Scarface's plot. To view its contents, simply type the variable name in the MongoDB Shell (mongosh):
scarfaceEmbedding
You will see an output similar to the following, which shows the vector stored as a binary Float32 array:
Binary.fromFloat32Array(new Float32Array([
-0.01557971816509962, -0.034283190965652466, 0.015228296630084515,
-0.0426131971180439, -0.020851051434874535, 0.026343651115894318,
-0.0049882433377206326, -0.004425317049026489, -0.023844648152589798,
-0.0199139267206192, 0.02795759029686451, 0.00792001560330391,
0.012638184241950512, 0.00531038036569953, 0.01340610720217228,
-0.02022630162537098, 0.039020881056785583, -0.010965675115585327,
0.002731656888499856, -0.014863857999444008, -0.005915607325732708,
-0.004200797062367201, -0.019367268308997154, 0.0013357298448681831,
0.004958957899361849, 0.008720477111637592, 0.010080611333251,
-0.018820611760020256, 0.02522430568933487, -0.01318484079092741,
0.028946777805685997, -0.00011225987691432238, 0.003566284663975239,
-0.003471921430900693, -0.013627372682094574, -0.010366955772042274,
-0.00012344519200269133, -0.010308384895324707, -0.012221683748066425,
-0.010998213663697243, 0.015020046383142471, -0.0034849371295422316,
-0.0035565230064094067, -0.029259154573082924, -0.019965987652540207,
-0.01062076073139906, 0.0012934290571138263, -0.019497426226735115,
-0.03035246767103672, 0.0035988239105790854, 0.010985198430716991,
0.006631467491388321, -0.012872465886175632, -0.014056889340281487,
0.004435078706592321, 0.003715964499861002, 0.005362442694604397,
0.003130260854959488, 0.001144562615081668, -0.002655190182849765,
0.0086619071662426, -0.00901332963258028, -0.004132464993745089,
-0.011551378294825554, 0.005554423667490482, 0.005447044502943754,
-0.02154088020324707, 0.005684579722583294, -0.002694237045943737,
-0.014759733341634274, 0.04068688303232193, 0.034621596336364746,
0.0036606481298804283, -0.0037908044178038836, 0.015683842822909355,
-0.0061075882986187935, -0.02630460448563099, 0.00940379872918129,
0.0009387528989464045, 0.013848639093339443, 0.010894089005887508,
-0.022972600534558296, -0.01813078299164772, 0.02532843127846718,
0.00687876483425498, 0.007510023191571236, -0.011317097581923008,
0.017336830496788025, -0.00783541426062584, -0.005121653433889151,
-0.015098139643669128, 0.002721895230934024, 0.021007239818572998,
0.011941847391426563, -0.018885690718889236, -0.0031286340672522783,
-0.026551900431513786, -0.0012560090981423855, 0.0047507076524198055,
-0.028035683557391167,
... 1436 more items
]))
Now that the scarfaceEmbedding variable contains our query vector, we can use it as input for the $vectorSearch operator.
This operator will compare the Scarface plot embedding with the embeddings of all other movies in the embedded_movies collection to find those with the most similar semantic meaning.
Run the $vectorSearch stage on the plot_embedding field to compare the Scarface plot embedding with all other movie embeddings in the collection. Use a $match stage to exclude Scarface itself from the search results. This is necessary because the query vector is taken directly from Scarface's own embedding, meaning it would otherwise match itself with a perfect similarity score of 1.0.
C++
db.getSiblingDB("sample_mflix").embedded_movies.aggregate([
{
$vectorSearch: {
index: "plot_embedding_index",
path: "plot_embedding",
queryVector: scarfaceEmbedding,
numCandidates: 200,
limit: 9
}
},
{
$match: { title: { $ne: "Scarface" } }
},
{
$project: {
_id: 0,
title: 1,
genres: 1,
score: { $meta: "vectorSearchScore" }
}
}
])
This query returns a list of the top nine movies whose plot embeddings are most semantically similar to Scarface. A higher score means a stronger semantic relationship between the plots.
C++
[
{
genres: [ 'Action', 'Drama' ],
title: 'Rough Cut',
score: 0.9522117376327515
},
{
genres: [ 'Action', 'Crime', 'Drama' ],
title: 'A Bittersweet Life',
score: 0.9522027969360352
},
{
genres: [ 'Action', 'Drama', 'Thriller' ],
title: 'Billa 2',
score: 0.9512894153594971
},
{
genres: [ 'Action', 'Crime', 'Drama' ],
title: 'Running Scared',
score: 0.948520302772522
},
{
genres: [ 'Action', 'Crime', 'Drama' ],
title: 'New Jack City',
score: 0.9483931660652161
},
{
genres: [ 'Action', 'Crime', 'Drama' ],
title: 'Ghost Dog: The Way of the Samurai',
score: 0.9477109909057617
},
{
genres: [ 'Action', 'Crime', 'Drama' ],
title: 'Twelve',
score: 0.946926474571228
},
{
genres: [ 'Action', 'Crime', 'Drama' ],
title: 'Singham',
score: 0.9466689825057983
}
]
The results above show that several films, including Rough Cut, A Bittersweet Life, Billa 2, New Jack City, Ghost Dog: The Way of the Samurai, and Singham, achieve similarity scores above 0.94.
These high scores indicate that their plot descriptions share strong semantic similarities with Scarface, capturing comparable narrative elements such as crime, ambition, violence, and moral tension.
Most of these movies also fall under the Action, Crime, and Drama genres, which confirms that the vector search accurately identifies films with related themes and storylines.
5. Summary
In MongoDB, vector search works by comparing high-dimensional numerical embeddings that represent the meaning of text, images, or other types of data. These embeddings are not generated by MongoDB itself but are created externally using machine learning models such as OpenAI’s text-embedding-ada-002 or Voyage AI’s voyage-3-large. After generation, the embeddings are stored in MongoDB documents as binary data for efficient storage and retrieval.
A vector index powered by Apache Lucene organizes the stored embeddings to enable fast similarity calculations. When a query vector is provided, MongoDB measures the cosine distance between the query and stored vectors to find the most semantically related documents. This design allows MongoDB to integrate with external AI models and deliver semantic search and recommendation features directly within the database.
Explore
Introduction
Installation
Basics of MongoDB
MongoDB Methods
Comparison Operators
Logical Operators
Arithmetic Operators
Field Update Operators
Array Expression Operators
Array Update Operators