欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/150495514
免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。
vLLM 是生产环境的高性能大模型推理与服务框架,通过 PagedAttention 机制,将显存占用降低至 1/3,支持在单张 GPU 上并行处理长序列请求,搭配连续批处理与 CUDA 图优化,把LLM 的吞吐量提升 10–24 倍,同时保持毫秒级延迟。支持张量并行、KV 缓存量化、LoRA 插件化加载等特性,是构建可扩展、低成本大模型应用的首选工具。
相关参考:
- GitHub:https://docs.vllm.ai/en/stable/
- Qwen - vLLM:https://qwen.readthedocs.io/en/latest/deployment/vllm.html
1. GPU 环境
Nvidia-SMI (NVIDIA System Management Interface),NVIDIA 系统管理接口,版本号是 12.4 版本,即:
Sun Aug 17 19:07:21 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000001:08:00.0 Off | Off |
| 44% 28C P8 32W / 450W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000001:0C:00.0 Off | Off |
| 44% 28C P8 31W / 450W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 On | 00000001:8C:00.0 Off | Off |
| 45% 28C P8 30W / 450W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 On | 00000001:8D:00.0 Off | Off |
| 30% 27C P8 23W / 450W | 2MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
NVCC (NVIDIA CUDA Compiler),NVIDIA CUDA 编译器版本,版本号是 12.2 版本,即:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
vLLM 的 Docker 版本:vllm:0.8.1-1.0
2. 部署大模型
语言大模型 Qwen3-32B,使用 2卡 x 4090(49G),启动服务,即:
vllm serve modelscope_models/Qwen/Qwen3-32B --trust-remote-code --tensor-parallel-size 2 --host 0.0.0.0 --port 9002 --served-model-name Qwen/Qwen3-32B
测试:
curl "http://localhost:9002/v1/chat/completions" -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
视觉语言大模型,注意 2卡 x 4090(49G),无法启动,至少需要 4卡 x 4090(49G),同步设置 tensor-parallel-size
, 即::
# 配置环境变量,性能更强
export TORCH_CUDA_ARCH_LIST="9.0"
export USE_FAST=True
export VLLM_IMAGE_FETCH_TIMEOUT=20 # 处理图像时间增加至20s
vllm serve modelscope_models/Qwen/Qwen2.5-VL-32B-Instruct/ --trust-remote-code --tensor-parallel-size 4 --host 0.0.0.0 --port 9002 --served-model-name Qwen/Qwen2.5-VL-32B-Instruct --limit-mm-per-prompt "image=4"
--limit-mm-per-prompt "image=4"'
支持输入多图,默认只支持单张图像,报错如下:{"object":"error","message":"At most 1 image(s) may be provided in one request.","type":"BadRequestError","param":null,"code":400}
,参考 Bug: ValueError: At most 1 image(s) may be provided in one request.
相关日志:
INFO 08-17 19:14:01 [kv_cache_utils.py:716] GPU KV cache size: 410,320 tokens
INFO 08-17 19:14:01 [kv_cache_utils.py:720] Maximum concurrency for 128,000 tokens per request: 3.21x
测试:
curl "http://localhost:9002/v1/chat/completions" -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-VL-32B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/dog.png"
}
},
{
"type": "text",
"text": "请简要描述图片是什么内容?"
}
]
}
],
"temperature": 0.7,
"stream": false
}'
VLM 排名:https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
3. 端口映射
vLLM 模型服务启动之后,端口需要映射至 根路由 /
,即:
注意:SGLang 模型服务启动之后,端口需要映射至 访问路由 /v1/chat/completions
,两者略有差别。即:
4. 压测报告
测试图像,901 x 668 = 601k 像素,压测图像理解能力,消耗 Tokens 如下:
{
"prompt_tokens": 2034,
"total_tokens": 2417,
"completion_tokens": 383
}
对于 2个实例,每个实例 4卡 x 4090(49G),压测性能,如下:
🎯 LLMPressTest自动化压测完成!最大承载量分析:
====================================================================================================
📊 所有测试结果:
并发数 完成请求 成功率 QPS 平均响应时间
------------------------------------------------------------
4 20 100.0 % 0.1 20.63 s
8 32 100.0 % 0.5 15.86 s
12 48 100.0 % 0.5 17.27 s
16 64 100.0 % 0.6 18.65 s
🏆 最佳性能配置:
并发数: 16
QPS: 0.55
成功率: 100.00%
平均响应时间: 18.65s
💡 生产环境推荐配置:
推荐并发数: 11 (最大并发的70%)
预期QPS: 0.39