[SIGMOD'25] PQCache: Product Quantization-based KVCache for Long Context LLM Inference (Paper)
Codes for the SIGMOD 2025 paper "PQCache: Product Quantization-based KVCache for Long Context LLM Inference".
As the field of Large Language Models (LLMs) continues to evolve, the context length in inference is steadily growing. Key-Value Cache (KVCache), a crucial component in LLM inference, has now become the primary memory bottleneck due to limited GPU memory. Current methods selectively determine suitable keys and values for self-attention computation in LLMs to address the issue. However, they either fall short in maintaining model quality or result in high serving latency. Drawing inspiration from advanced embedding retrieval techniques used in the database community, we consider the storage and searching of KVCache as a typical embedding retrieval problem. We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency. During the prefilling phase, we apply PQ to tokens' keys for each LLM layer and head. During the autoregressive decoding phase, for each newly generated token, we first identify important tokens through Maximum Inner-Product Search (MIPS) using PQ codes and centroids, then fetch the corresponding key-value pairs for self-attention computation. Through meticulous design of overlapping and caching, we minimize any additional computation and communication overhead during both phases. Extensive experiments show that PQCache achieves both effectiveness and efficiency. It maintains model quality even when only 1/5 of the tokens are involved in attention, while attaining acceptable system latency.
Our python environment management is based on miniconda3. After installing miniconda3, execute commands below to setup.
conda create -n pqcache python=3.10
conda activate pqcache
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
# We install the "flash-attn" package from a downloaded wheel
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.8+cu122torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -r requirements.txtCurrent implementation of PQCache requires that the GPU's Compute Capability
Currently supported models include meta-llama/Llama-3.1-8B-Instruct and mistralai/Mistral-7B-Instruct-v0.2.
- First compile lfucache for GPU cache:
cd vq_method/retrieval_based/lfu/
mkdir build; cd build; cmake ..; make
cd ../../../../- Then download the datasets of LongBench to
./data/.
# You could simply use this link to download the data.
wget https://huggingface.co/datasets/zai-org/LongBench/resolve/main/data.zip(Our experiment is based on LongBench v1)
- [Optional] If you want to use local model checkpoints, please modify the paths listed in
config/model2path.json. Note: Downloading the Llama-3.1 model requires huggingface authentication. Please refer to the content in the link.
{
"mistral-7b-Instruct-32k": "[MISTRAL_MODEL_PATH]",
"llama-3.1": "[LLAMA_MODEL_PATH]"
}- Run the script:
# Please modify certain environment variables in the script according to your environment.
# Such as CUDA_VISIBLE_DEVICES
bash run_llama.shIn the default configuration, we use 48 CPU cores for clustering computations. You could modify the MAX_CPU_IN_USE in run_llama.sh to adapt to the runtime environment.
We recommend setting MAX_CPU_IN_USE to a multiple of the product of the model’s n_kv_head(e.g., 8 for Llama3.1) and SUBVEC(e.g., 2).
- Run the evaluation script after completing the generation of all samples:
python eval.py --model llama-3.1 --dataset narrativeqa --exp_name default
# python eval.py --model llama-3.1 --dataset [DATASET_NAMES,] --exp_name [EXP_NAME]The evaluation results locate in pred/llama-3.1/narrativeqa/default(corresponding to the results in Table 2 of the paper).
You could also evaluate on multiple tasks by executing:
# For example, evaluate on [narrativeqa, qasper, trec].
python eval.py --model llama-3.1 --dataset narrativeqa qasper trec --exp_name default
# Gather the results of multiple tasks
python parse_result.py \
--model llama-3.1 \
--result_path pred \
--exp_name default \
--output_path default.jsonOur codes are mainly in the vq_method directory.
- retrieval_based
- lfu: codes for GPU cache.
- cache_manager.py: codes for cache management.
- multi_core_compressor_v2.py: codes for multi-CPU-core compression.
- pq_search.py: codes for PQ compressor.
- mistral_patch.py: codes for replacing the original attention in Mistral.
- llama31_patch.py: codes for replacing the original attention in Llama-3.1.
During the development and implementation of PQCache, we learned a lot and borrowed some codes from the following projects.
LongBench
H2O
InfLLM
SPARQ
Hetu
If you find this work useful, please cite our paper.
