CYBERTEC PostgreSQL Logo

Indexing vectors in PostgreSQL

04.2025 / Category: / Tags:

In the previous post we have imported a fairly large data set containing Wikipedia data, which we downloaded using pgai. However, importing all this data is not enough because we also need to keep an eye on efficiency. Therefore, it is important to understand that indexing is the key to success.

In general, pgvector provides us with two types of indexes that we can use:

- hnsw
- ivfflat

The core questions: When do we need which type of index and how can we create those indexes in the first place?

In general, HNSW (a multilayer graph) is faster during execution time ("SELECT") but is a LOT slower during index creation, as we will see a little later. If index creation time is not an issue, HNSW is definitely a good choice - if index creation time does matter? Well, not so much.

Indexing a table containing vectors

To show how indexing can be done, we are using the following table, which has been created by the vectorization process described in the previous post:

Let us recall the content of those two tables (= raw data and vector data). The relation containing Wikipedia data contains 6.4 million rows that are broken down into 41 million chunks, which are turned into vectors:

As a next step, we want to index those tables. Various parameters are super important to achieving good indexing performance. If you want to know more about "Tuning index creation in PostgreSQL", we have prepared some technical information about this, which hopefully sheds some light onto the topic.

In my case, the machine is sufficiently large to justify fairly high values for maintenance_work_mem. What is important to note here is: People often ask what happens in case of parallel index creation. Is this parameter "per CREATE INDEX" or "per process involved in CREATE INDEX"? The answer is: It depends on the type of index you are creating. For those commonly used index types, such as standard b-trees, it is the overall amount of memory allowed. However, to be precise: The index type can decide on its own - so it might not always be the same for all rarely used index types out there. 

Anyway, let me ensure that PostgreSQL is using multiple processes to speed up the process even more:

Finally, we can deploy the index. For a start, we will create an "HNSW" index using cosine distance to organize data inside the index:

It is an understatement to say that creating the index takes a while. In fact: It takes half a day to deploy this index.

Semantic Search blog graphic 2

Monitoring HNSW index creation

When the index creation starts, we will see that PostgreSQL automatically fires up a couple of processes in parallel to work on this index. The CPU is fully loaded as we can see various processes operating at full speed:

The question naturally arising is: What eats up so much CPU? Well, the profiler reveals the secret:

It is all about a single function that is called over and over and over again; but what does it actually do? Here is the code:

Welcome to a highly important function in the realm of machine learning and AI. In our implementation, this is running inside the CPU. Of course, this could be done in a GPU, but that is not so easy given the standard interfaces this is built on. However, while there is room for improvement, it at least means that
the problem is well understood.

What does the final index look like? Before we answer that, we will inspect the table size first:

graphics for - Indexing vectors

The first observation is that the vector data has added a fair amount of space consumption to the scenery. The same is true for the index we have just made:

Wow, the index is 77 GB, which is a lot more than the underlying data.

Creating an IVFFLAT index with pgvector

What we have just seen is that using an HNSW takes a long time. It does provide better query performance - but it takes longer.

To compare, we can take a look at IVFFLAT indexes. Here is how we can generate one:

The first thing to note is that the index creation shows a totally different CPU usage. The CPU consumption is way lower. We can clearly see this in the end result:

Wow, we did all of this in 38 minutes. A relevant fraction of this was spent on I/O, while the CPU did not spend hour after hour on vector products. However, as stated before: While indexing is faster, it does have performance implications later on, which we will discuss in one of the next blogs.

The size of this index is around 63 GB, so the difference in index size is not really relevant.

Summary

For those of you interested in PostgreSQL and pgvector out there, the key takeaway is really: Not all indexes are created equal and deciding on which type of index to use can impact various aspects of performance on the creation as well as on the query side (which will be covered in more detail in one of the next postings).

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram