Indexing vectors in PostgreSQL

In the previous post we have imported a fairly large data set containing Wikipedia data, which we downloaded using pgai. However, importing all this data is not enough because we also need to keep an eye on efficiency. Therefore, it is important to understand that indexing is the key to success.

Table of Contents

In general, pgvector provides us with two types of indexes that we can use:

- hnsw
- ivfflat

The core questions: When do we need which type of index and how can we create those indexes in the first place?

In general, HNSW (a multilayer graph) is faster during execution time ("SELECT") but is a LOT slower during index creation, as we will see a little later. If index creation time is not an issue, HNSW is definitely a good choice - if index creation time does matter? Well, not so much.

Indexing a table containing vectors

To show how indexing can be done, we are using the following table, which has been created by the vectorization process described in the previous post:

cybertec=# \d demo_wiki_emb_store
                   Table "public.demo_wiki_emb_store"
     Column     |    Type     | Collation | Nullable |      Default      
----------------+-------------+-----------+----------+-------------------
 embedding_uuid | uuid        |           | not null | gen_random_uuid()
 id             | text        |           | not null | 
 chunk_seq      | integer     |           | not null | 
 chunk          | text        |           | not null | 
 embedding      | vector(384) |           | not null | 
Indexes:
    "demo_wiki_emb_store_pkey" PRIMARY KEY, btree (embedding_uuid)
    "demo_wiki_emb_store_id_chunk_seq_key" UNIQUE CONSTRAINT, btree (id, chunk_seq)
Foreign-key constraints:
    "demo_wiki_emb_store_id_fkey" FOREIGN KEY (id) REFERENCES wiki(id) ON DELETE CASCADE

cybertec=# \d demo_wiki_emb_store

Table "public.demo_wiki_emb_store"

Column | Type | Collation | Nullable | Default

----------------+-------------+-----------+----------+-------------------

embedding_uuid | uuid | | not null | gen_random_uuid()

id | text | | not null |

chunk_seq | integer | | not null |

chunk | text | | not null |

embedding | vector(384) | | not null |

Indexes:

"demo_wiki_emb_store_pkey" PRIMARY KEY, btree (embedding_uuid)

"demo_wiki_emb_store_id_chunk_seq_key" UNIQUE CONSTRAINT, btree (id, chunk_seq)

Foreign-key constraints:

"demo_wiki_emb_store_id_fkey" FOREIGN KEY (id) REFERENCES wiki(id) ON DELETE CASCADE

Let us recall the content of those two tables (= raw data and vector data). The relation containing Wikipedia data contains 6.4 million rows that are broken down into 41 million chunks, which are turned into vectors:

cybertec=# SELECT count(*) FROM demo_wiki_emb;
  count   
----------
 40961670
(1 row)

cybertec=# SELECT count(*) FROM demo_wiki_emb;

count

----------

40961670

(1 row)

As a next step, we want to index those tables. Various parameters are super important to achieving good indexing performance. If you want to know more about "Tuning index creation in PostgreSQL", we have prepared some technical information about this, which hopefully sheds some light onto the topic.

In my case, the machine is sufficiently large to justify fairly high values for maintenance_work_mem. What is important to note here is: People often ask what happens in case of parallel index creation. Is this parameter "per CREATE INDEX" or "per process involved in CREATE INDEX"? The answer is: It depends on the type of index you are creating. For those commonly used index types, such as standard b-trees, it is the overall amount of memory allowed. However, to be precise: The index type can decide on its own - so it might not always be the same for all rarely used index types out there.

cybertec=# \timing
Timing is on.
cybertec=# SET maintenance_work_mem TO '32 GB';
SET
Time: 6.509 ms

cybertec=# \timing

Timing is on.

cybertec=# SET maintenance_work_mem TO '32 GB';

SET

Time: 6.509 ms

Anyway, let me ensure that PostgreSQL is using multiple processes to speed up the process even more:

cybertec=# SET max_parallel_maintenance_workers TO 8;
SET
Time: 0.214 ms

cybertec=# SET max_parallel_maintenance_workers TO 8;

SET

Time: 0.214 ms

Finally, we can deploy the index. For a start, we will create an "HNSW" index using cosine distance to organize data inside the index:

cybertec=# CREATE INDEX ON demo_wiki_emb_store 
		USING hnsw (embedding vector_cosine_ops);
NOTICE:  hnsw graph no longer fits into maintenance_work_mem after 15488425 tuples
DETAIL:  Building will take significantly more time.
HINT:  Increase maintenance_work_mem to speed up builds.
CREATE INDEX
Time: 44216893.145 ms (12:16:56.893)

cybertec=# CREATE INDEX ON demo_wiki_emb_store

USING hnsw (embedding vector_cosine_ops);

NOTICE: hnsw graph no longer fits into maintenance_work_mem after 15488425 tuples

DETAIL: Building will take significantly more time.

HINT: Increase maintenance_work_mem to speed up builds.

CREATE INDEX

Time: 44216893.145 ms (12:16:56.893)

It is an understatement to say that creating the index takes a while. In fact: It takes half a day to deploy this index.

Monitoring HNSW index creation

When the index creation starts, we will see that PostgreSQL automatically fires up a couple of processes in parallel to work on this index. The CPU is fully loaded as we can see various processes operating at full speed:

top - 18:51:56 up  2:09,  1 user,  load average: 2.89, 3.06, 3.00
Tasks: 661 total,   9 running, 652 sleeping,   0 stopped,   0 zombie
%Cpu(s): 24.5 us,  2.5 sy,  0.0 ni, 72.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem : 128687.4 total,  62662.7 free,  43548.9 used,  57218.3 buff/cache     
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  85138.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
  14311 hs        20   0   32.3g 321752 317456 R  99.7   0.2   3:16.71 postgres 
  14539 hs        20   0   32.2g 314060 310532 R  99.7   0.2   0:13.76 postgres
  14541 hs        20   0   32.2g 313812 310028 R  99.7   0.2   0:13.75 postgres
  14542 hs        20   0   32.2g 314152 310368 R  99.7   0.2   0:13.75 postgres
  14545 hs        20   0   32.2g 314588 311060 R  99.7   0.2   0:13.74 postgres
  14540 hs        20   0   32.2g 314276 310748 R  99.3   0.2   0:13.76 postgres
  14543 hs        20   0   32.2g 313996 310724 R  99.3   0.2   0:13.75 postgres
  14544 hs        20   0   32.2g 313728 309944 R  99.3   0.2   0:13.73 postgres
...

top - 18:51:56 up 2:09, 1 user, load average: 2.89, 3.06, 3.00

Tasks: 661 total, 9 running, 652 sleeping, 0 stopped, 0 zombie

%Cpu(s): 24.5 us, 2.5 sy, 0.0 ni, 72.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

MiB Mem : 128687.4 total, 62662.7 free, 43548.9 used, 57218.3 buff/cache

MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 85138.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

14311 hs 20 0 32.3g 321752 317456 R 99.7 0.2 3:16.71 postgres

14539 hs 20 0 32.2g 314060 310532 R 99.7 0.2 0:13.76 postgres

14541 hs 20 0 32.2g 313812 310028 R 99.7 0.2 0:13.75 postgres

14542 hs 20 0 32.2g 314152 310368 R 99.7 0.2 0:13.75 postgres

14545 hs 20 0 32.2g 314588 311060 R 99.7 0.2 0:13.74 postgres

14540 hs 20 0 32.2g 314276 310748 R 99.3 0.2 0:13.76 postgres

14543 hs 20 0 32.2g 313996 310724 R 99.3 0.2 0:13.75 postgres

14544 hs 20 0 32.2g 313728 309944 R 99.3 0.2 0:13.73 postgres

...

The question naturally arising is: What eats up so much CPU? Well, the profiler reveals the secret:

  40.31%  vector.so                             [.] vector_negative_inner_product
  17.10%  vector.so                             [.] HnswSearchLayer
  15.87%  postgres                              [.] pg_detoast_datum
   2.31%  vector.so                             [.] offsethash_insert_hash_internal

40.31% vector.so [.] vector_negative_inner_product

17.10% vector.so [.] HnswSearchLayer

15.87% postgres [.] pg_detoast_datum

2.31% vector.so [.] offsethash_insert_hash_internal

It is all about a single function that is called over and over and over again; but what does it actually do? Here is the code:

VECTOR_TARGET_CLONES static float
VectorInnerProduct(int dim, float *ax, float *bx)
{
        float           distance = 0.0;
        
        /* Auto-vectorized */
        for (int i = 0; i < dim; i++)
                distance += ax[i] * bx[i];
                
        return distance;
}

VECTOR_TARGET_CLONES static float

VectorInnerProduct(int dim, float *ax, float *bx)

{

float distance = 0.0;

/* Auto-vectorized */

for (int i = 0; i < dim; i++)

distance += ax[i] * bx[i];

return distance;

}

Welcome to a highly important function in the realm of machine learning and AI. In our implementation, this is running inside the CPU. Of course, this could be done in a GPU, but that is not so easy given the standard interfaces this is built on. However, while there is room for improvement, it at least means that
the problem is well understood.

What does the final index look like? Before we answer that, we will inspect the table size first:

                      List of relations
 Schema |        Name         | Type  | ... |  Size   |  
--------+---------------------+-------+-----+---------+
 public | demo_wiki_emb       | view  |     | 0 bytes | 
 public | demo_wiki_emb_store | table |     | 92 GB   | 
 public | wiki                | table |     | 13 GB   | 
(3 rows)

List of relations

--------+---------------------+-------+-----+---------+

(3 rows)

The first observation is that the vector data has added a fair amount of space consumption to the scenery. The same is true for the index we have just made:

cybertec=# SELECT pg_size_pretty(
		  pg_total_relation_size('demo_wiki_emb_store_embedding_idx'));
 pg_size_pretty 
----------------
 77 GB
(1 row)

cybertec=# SELECT pg_size_pretty(

pg_total_relation_size('demo_wiki_emb_store_embedding_idx'));

pg_size_pretty

----------------

77 GB

(1 row)

Wow, the index is 77 GB, which is a lot more than the underlying data.

Creating an IVFFLAT index with pgvector

What we have just seen is that using an HNSW takes a long time. It does provide better query performance - but it takes longer.

To compare, we can take a look at IVFFLAT indexes. Here is how we can generate one:

cybertec=# CREATE INDEX ON demo_wiki_emb_store 
		USING ivfflat (embedding vector_cosine_ops);

1 2	cybertec=# CREATE INDEX ON demo_wiki_emb_store USING ivfflat (embedding vector_cosine_ops);

The first thing to note is that the index creation shows a totally different CPU usage. The CPU consumption is way lower. We can clearly see this in the end result:

Time: 2314162.345 ms (38:34.162)

1	Time: 2314162.345 ms (38:34.162)

Wow, we did all of this in 38 minutes. A relevant fraction of this was spent on I/O, while the CPU did not spend hour after hour on vector products. However, as stated before: While indexing is faster, it does have performance implications later on, which we will discuss in one of the next blogs.

The size of this index is around 63 GB, so the difference in index size is not really relevant.

Summary

For those of you interested in PostgreSQL and pgvector out there, the key takeaway is really: Not all indexes are created equal and deciding on which type of index to use can impact various aspects of performance on the creation as well as on the query side (which will be covered in more detail in one of the next postings).

Indexing vectors in PostgreSQL

Indexing a table containing vectors

Monitoring HNSW index creation

Creating an IVFFLAT index with pgvector

Summary

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts