vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

vLLM是一个开源库,利用PagedAttention技术显著提升大规模语言模型(LLM)的推理速度和内存效率。与HuggingFaceTransformers相比,其性能最高可达24倍,且无需更改模型架构。PagedAttention通过虚拟内存管理和块式注意力机制优化内存使用,使得小团队也能负担得起LLM服务。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Resources:

paper: https://arxiv.org/pdf/2309.06180.pdf

repo: GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

highlights blog by authors: vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog

Blog Note with Details from Paper

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

vLLM has been developed at UC Berkeley and deployed at Chatbot Arena and Vicuna Demo for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our GitHub repository.

Beyond State-of-the-art Performance

We compare the throughput of vLLM with HuggingFace Transformers (HF), the most popular LLM library and HuggingFace Text Generation Inference (TGI), the previous state of the art. We evaluate in two settings: LLaMA-7B on an NVIDIA A10G GPU and LLaMA-13B on an NVIDIA A100 GPU (40GB). We sample the requests’ input/output lengths from the ShareGPT dataset. In our experiments, vLLM achieves up to 24x higher throughput compared to HF and up to 3.5x higher throughput than TGI.

Serving throughput when each request asks for one output completion. vLLM achieves 14x - 24x higher throughput than HF and 2.2x - 2.5x higher throughput than TGI.

Serving throughput when each request asks for three parallel output completions. vLLM achieves 8.5x - 15x higher throughput than HF and 3.3x - 3.5x higher throughput than TGI.

The Secret Sauce: PagedAttention

In vLLM, we identify that the performance of LLM serving is bottlenecked by memory. In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is

  • Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.
  • Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.

To address this problem, we introduce PagedAttention, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.

PagedAttention: KV Cache are partitioned into blocks. Blocks do not need to be contiguous in memory space.

Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous logical blocks of a sequence are mapped to non-contiguous physical blocks via a block table. The physical blocks are allocated on demand as new tokens are generated.

Example generation process for a request with PagedAttention.

In PagedAttention, memory waste only happens in the last block of a sequence. In practice, this results in near-optimal memory usage, with a mere waste of under 4%. This boost in memory efficiency proves highly beneficial: It allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput as shown in the performance result above.

PagedAttention has another key advantage: efficient memory sharing. For example, in parallel sampling, multiple output sequences are generated from the same prompt. In this case, the computation and memory for the prompt can be shared between the output sequences.

Example of parallel sampling.

PagedAttention naturally enables memory sharing through its block table. Similar to how processes share physical pages, different sequences in PagedAttention can share the blocks by mapping their logical blocks to the same physical block. To ensure safe sharing, PagedAttention keeps track of the reference counts of the physical blocks and implements the Copy-on-Write mechanism. ==> easier and more efficient concatenation, a curtesy of block-wise storage

Copy-On-Write (COW) is a resource-management technique used in computer programming to efficiently implement a “duplicate” or “copy” operation on modifiable resources (most commonly memory pages, storage sectors, files, and data structures) 1It is sometimes referred to as implicit sharing or shadowing 1.

In virtual memory management, Copy-On-Write finds its main use in operating systems, sharing the physical memory of computers running multiple processes, in the implementation of the fork() system call 1. The new process does not modify any memory and immediately executes a new process, replacing the address space entirely. It would waste processor time and memory to copy all of the old process’s memory during the fork only to immediately discard the copy. Copy-On-Write can be implemented efficiently using the page table by marking certain pages of memory as read-only and keeping a count of the number of references to the page. When data is written to these pages, the operating-system kernel intercepts the write attempt and allocates a new physical page, initialized with the copy-on-write data, although the allocation can be skipped if there is only one reference. The kernel then updates the page table with the new (writable) page, decrements the number of references, and performs the write. The new allocation ensures that a change in the memory of one process is not visible in another’s 

Example generation process for a request that samples multiple outputs.

PageAttention’s memory sharing greatly reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, cutting their memory usage by up to 55%. This can translate into up to 2.2x improvement in throughput. This makes such sampling methods practical in LLM services.

PagedAttention is the core technology behind vLLM, our LLM inference and serving engine that supports a variety of models with high performance and an easy-to-use interface. For more technical details about vLLM and PagedAttention, check out our GitHub repo and stay tuned for our paper.

The Silent Hero Behind LMSYS Vicuna and Chatbot Arena

This April, LMSYS developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in Chatbot Arena for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based serving backend to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM as the new backend in order to support the growing demands (up to 5x more traffic). In an early internal micro-benchmark by LMSYS, the vLLM serving backend can achieve up to 30x higher throughput than an initial HF backend.

Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with high throughput and low latency. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The support for more models is being developed and forthcoming.

Requests served by FastChat-vLLM integration in the Chatbot Arena between April to May. Indeed, more than half of the requests to Chatbot Arena use vLLM as the inference backend.

This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.

Get started with vLLM

Install vLLM with the following command (check out our installation guide for more):

$ pip install vllm

vLLM can be used for both offline inference and online serving. To use vLLM for offline inference, you can import vLLM and use the LLM class in your Python scripts:

from vllm import LLM

prompts = ["Hello, my name is", "The capital of France is"]  # Sample prompts.
llm = LLM(model="lmsys/vicuna-7b-v1.3")  # Create an LLM.
outputs = llm.generate(prompts)  # Generate texts from the prompts.

To use vLLM for online serving, you can start an OpenAI API-compatible server via:

$ python -m vllm.entrypoints.openai.api_server --model lmsys/vicuna-7b-v1.3

You can query the server with the same format as OpenAI API:

$ curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "lmsys/vicuna-7b-v1.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

For more ways to use vLLM, please check out the quickstart guide.

<think>好的,用户现在想了解vLLM的部署步骤。之前他们询问过Xinference的部署,现在转向vLLM,可能需要比较两者的不同。首先,我需要回忆vLLM的相关知识,确保步骤准确。用户可能是在部署大模型时遇到显存或性能问题,所以需要详细步骤和优化建议。用户之前的对话提到过环境准备,比如Anaconda和Docker,所以可能对这两种方式都感兴趣。需要分本地和Docker部署两部分,可能还要包括分布式部署。引用[1]中提到了vLLM的显存优化,这可能要结合到部署步骤中,比如GPU配置和参数调整。需要检查vLLM的官方文档,确认安装命令是否正确,比如是否支持pip直接安装。另外,用户提到的模型下载加速,可能需要添加国内镜像源的信息,比如使用ModelScope或阿里云。GPU加速部分,要强调CUDA和NVIDIA驱动的安装,以及运行时的参数设置。用户可能对分布式部署感兴趣,因为之前问过Xinference的集群部署。需要解释如何启动多个worker节点和配置环境变量。监控部分,引用[1]提到vLLM的内存预测,可能需要说明如何监控显存使用情况,比如使用nvidia-smi或内置的API。还要注意用户可能遇到的常见问题,比如模型路径设置、依赖冲突、端口占用等,在注意事项里提醒。最后生成的相关问题要围绕部署后的优化、监控和性能分析,帮助用户深入使用vLLM。</think>### vLLM 部署步骤详解 #### 一、环境准备 1. **安装 CUDA 驱动** - vLLM 依赖 NVIDIA GPU 加速,需先安装 CUDA 11.8+ 和对应驱动[^1] ```bash # 查看 CUDA 版本 nvidia-smi # 安装 CUDA Toolkit(以 11.8 为例) wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run sudo sh cuda_11.8.0_520.61.05_linux.run ``` 2. **创建 Python 虚拟环境** ```bash conda create -n vllm python=3.9 -y conda activate vllm ``` --- #### 二、基础部署(单节点) 1. **安装 vLLM** ```bash # 官方推荐安装方式 pip install vllm # 或从源码构建(支持最新特性) git clone https://github.com/vllm-project/vllm.git cd vllm && pip install -e . ``` 2. **启动 API 服务** ```bash # 启动 OpenAI 兼容 API python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-2-7b-chat-hf \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 ``` - 参数说明: - `--tensor-parallel-size`: 张量并行度(GPU 数量) - `--gpu-memory-utilization`: 显存利用率阈值(0.9 表示保留 10% 安全空间)[^1] 3. **验证服务** ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-2-7b-chat-hf", "prompt": "San Francisco is a", "max_tokens": 7 }' ``` --- #### 三、Docker 部署 1. **构建镜像** ```bash docker build -t vllm-demo -f Dockerfile \ --build-arg BASE_IMAGE=nvidia/cuda:12.1.0-devel-ubuntu22.04 \ --build-arg VLLM_VERSION=0.2.5 . ``` 2. **启动容器** ```bash docker run --gpus all -p 8000:8000 vllm-demo \ --model meta-llama/Llama-2-7b-chat-hf \ --trust-remote-code ``` --- #### 四、高级配置 1. **显存优化策略** - 通过 `--block-size` 控制 KV 缓存块大小(默认 16) - 使用公式估算显存需求: $$ M_{\text{total}} = M_{\text{model}} + M_{\text{kv}} \times n_{\text{blocks}} $$ 其中 $M_{\text{kv}}$ 表示每个块的 KV 缓存占用[^1] 2. **分布式部署** ```bash # Worker 节点 1 CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-chat-hf \ --tensor-parallel-size 2 \ --worker-address 192.168.1.10:8000 # Worker 节点 2 CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-chat-hf \ --tensor-parallel-size 2 \ --worker-address 192.168.1.11:8000 ``` --- #### 五、模型管理 1. **HuggingFace 模型加速** ```python from vllm import LLM llm = LLM( model="meta-llama/Llama-2-7b-chat-hf", enable_prefix_caching=True, # 启用前缀缓存优化 quantization="awq" # 使用 AWQ 量化 ) ``` 2. **自定义模型支持** - 继承 `vllm.model_executor.models.LLM` 实现自定义架构 - 通过 `--download-dir` 指定模型缓存路径 --- #### 注意事项 - 显存分配策略需根据实际 GPU 型号调整(如 A100/H100 差异) - 建议使用 vLLM 的连续批处理(continuous batching)提升吞吐量 - 监控工具推荐:`nvidia-smi`, `vLLM Prometheus Exporter` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值