In the previous post we have imported a fairly large data set containing Wikipedia data, which we downloaded using pgai. However, importing all this data is not enough because we also need to keep an eye on efficiency. Therefore, it is important to understand that indexing is the key to success.
Table of Contents
In general, pgvector provides us with two types of indexes that we can use:
- hnsw
- ivfflat
The core questions: When do we need which type of index and how can we create those indexes in the first place?
In general, HNSW (a multilayer graph) is faster during execution time ("SELECT") but is a LOT slower during index creation, as we will see a little later. If index creation time is not an issue, HNSW is definitely a good choice - if index creation time does matter? Well, not so much.
To show how indexing can be done, we are using the following table, which has been created by the vectorization process described in the previous post:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
cybertec=# \d demo_wiki_emb_store Table "public.demo_wiki_emb_store" Column | Type | Collation | Nullable | Default ----------------+-------------+-----------+----------+------------------- embedding_uuid | uuid | | not null | gen_random_uuid() id | text | | not null | chunk_seq | integer | | not null | chunk | text | | not null | embedding | vector(384) | | not null | Indexes: "demo_wiki_emb_store_pkey" PRIMARY KEY, btree (embedding_uuid) "demo_wiki_emb_store_id_chunk_seq_key" UNIQUE CONSTRAINT, btree (id, chunk_seq) Foreign-key constraints: "demo_wiki_emb_store_id_fkey" FOREIGN KEY (id) REFERENCES wiki(id) ON DELETE CASCADE |
Let us recall the content of those two tables (= raw data and vector data). The relation containing Wikipedia data contains 6.4 million rows that are broken down into 41 million chunks, which are turned into vectors:
1 2 3 4 5 |
cybertec=# SELECT count(*) FROM demo_wiki_emb; count ---------- 40961670 (1 row) |
As a next step, we want to index those tables. Various parameters are super important to achieving good indexing performance. If you want to know more about "Tuning index creation in PostgreSQL", we have prepared some technical information about this, which hopefully sheds some light onto the topic.
In my case, the machine is sufficiently large to justify fairly high values for maintenance_work_mem. What is important to note here is: People often ask what happens in case of parallel index creation. Is this parameter "per CREATE INDEX" or "per process involved in CREATE INDEX"? The answer is: It depends on the type of index you are creating. For those commonly used index types, such as standard b-trees, it is the overall amount of memory allowed. However, to be precise: The index type can decide on its own - so it might not always be the same for all rarely used index types out there.
1 2 3 4 5 |
cybertec=# \timing Timing is on. cybertec=# SET maintenance_work_mem TO '32 GB'; SET Time: 6.509 ms |
Anyway, let me ensure that PostgreSQL is using multiple processes to speed up the process even more:
1 2 3 |
cybertec=# SET max_parallel_maintenance_workers TO 8; SET Time: 0.214 ms |
Finally, we can deploy the index. For a start, we will create an "HNSW" index using cosine distance to organize data inside the index:
1 2 3 4 5 6 7 |
cybertec=# CREATE INDEX ON demo_wiki_emb_store USING hnsw (embedding vector_cosine_ops); NOTICE: hnsw graph no longer fits into maintenance_work_mem after 15488425 tuples DETAIL: Building will take significantly more time. HINT: Increase maintenance_work_mem to speed up builds. CREATE INDEX Time: 44216893.145 ms (12:16:56.893) |
It is an understatement to say that creating the index takes a while. In fact: It takes half a day to deploy this index.
When the index creation starts, we will see that PostgreSQL automatically fires up a couple of processes in parallel to work on this index. The CPU is fully loaded as we can see various processes operating at full speed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
top - 18:51:56 up 2:09, 1 user, load average: 2.89, 3.06, 3.00 Tasks: 661 total, 9 running, 652 sleeping, 0 stopped, 0 zombie %Cpu(s): 24.5 us, 2.5 sy, 0.0 ni, 72.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 128687.4 total, 62662.7 free, 43548.9 used, 57218.3 buff/cache MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 85138.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 14311 hs 20 0 32.3g 321752 317456 R 99.7 0.2 3:16.71 postgres 14539 hs 20 0 32.2g 314060 310532 R 99.7 0.2 0:13.76 postgres 14541 hs 20 0 32.2g 313812 310028 R 99.7 0.2 0:13.75 postgres 14542 hs 20 0 32.2g 314152 310368 R 99.7 0.2 0:13.75 postgres 14545 hs 20 0 32.2g 314588 311060 R 99.7 0.2 0:13.74 postgres 14540 hs 20 0 32.2g 314276 310748 R 99.3 0.2 0:13.76 postgres 14543 hs 20 0 32.2g 313996 310724 R 99.3 0.2 0:13.75 postgres 14544 hs 20 0 32.2g 313728 309944 R 99.3 0.2 0:13.73 postgres ... |
The question naturally arising is: What eats up so much CPU? Well, the profiler reveals the secret:
1 2 3 4 |
40.31% vector.so [.] vector_negative_inner_product 17.10% vector.so [.] HnswSearchLayer 15.87% postgres [.] pg_detoast_datum 2.31% vector.so [.] offsethash_insert_hash_internal |
It is all about a single function that is called over and over and over again; but what does it actually do? Here is the code:
1 2 3 4 5 6 7 8 9 10 11 |
VECTOR_TARGET_CLONES static float VectorInnerProduct(int dim, float *ax, float *bx) { float distance = 0.0; /* Auto-vectorized */ for (int i = 0; i < dim; i++) distance += ax[i] * bx[i]; return distance; } |
Welcome to a highly important function in the realm of machine learning and AI. In our implementation, this is running inside the CPU. Of course, this could be done in a GPU, but that is not so easy given the standard interfaces this is built on. However, while there is room for improvement, it at least means that
the problem is well understood.
What does the final index look like? Before we answer that, we will inspect the table size first:
1 2 3 4 5 6 7 |
List of relations Schema | Name | Type | ... | Size | --------+---------------------+-------+-----+---------+ public | demo_wiki_emb | view | | 0 bytes | public | demo_wiki_emb_store | table | | 92 GB | public | wiki | table | | 13 GB | (3 rows) |
The first observation is that the vector data has added a fair amount of space consumption to the scenery. The same is true for the index we have just made:
1 2 3 4 5 6 |
cybertec=# SELECT pg_size_pretty( pg_total_relation_size('demo_wiki_emb_store_embedding_idx')); pg_size_pretty ---------------- 77 GB (1 row) |
Wow, the index is 77 GB, which is a lot more than the underlying data.
What we have just seen is that using an HNSW takes a long time. It does provide better query performance - but it takes longer.
To compare, we can take a look at IVFFLAT indexes. Here is how we can generate one:
1 2 |
cybertec=# CREATE INDEX ON demo_wiki_emb_store USING ivfflat (embedding vector_cosine_ops); |
The first thing to note is that the index creation shows a totally different CPU usage. The CPU consumption is way lower. We can clearly see this in the end result:
1 |
Time: 2314162.345 ms (38:34.162) |
Wow, we did all of this in 38 minutes. A relevant fraction of this was spent on I/O, while the CPU did not spend hour after hour on vector products. However, as stated before: While indexing is faster, it does have performance implications later on, which we will discuss in one of the next blogs.
The size of this index is around 63 GB, so the difference in index size is not really relevant.
For those of you interested in PostgreSQL and pgvector out there, the key takeaway is really: Not all indexes are created equal and deciding on which type of index to use can impact various aspects of performance on the creation as well as on the query side (which will be covered in more detail in one of the next postings).
Leave a Reply