1. 高性能批量推理
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-70b-instruct", tensor_parallel_size=4)
prompts = [
"解释量子计算的量子比特原理。",
"用Python实现快速排序算法。",
"翻译以下句子为英文:人工智能正在改变世界。"
]
sampling_params = SamplingParams(temperature=0.3, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)
for i, output in enumerate(outputs):
print(f"Prompt {i+1}:\n{output.outputs[0].text}\n{'-'*50}")
2. 生产级API服务部署
python -m vllm.entrypoints.api_server \
--model Qwen/Qwen1.5-72B-Chat \
--tensor-parallel-size 8 \
--max-num-batched-tokens 16000 \
--port 8000 \
--host 0.0.0.0 \
--enforce-eager
3. 企业级API调用(带负载均衡)
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[502, 503]
)
session = requests.Session()
session.mount("http://", HTTPAdapter(max_retries=retry_strategy))
payload = {
"prompt": "生成企业级数据中台架构方案。",
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 1024,
"stream": True
}
response = session.post(
"http://api.your-company.com/v1/completions",
json=payload,
headers={"Authorization": "Bearer YOUR_API_KEY"},
timeout=30
)
if response.status_code == 200:
for chunk in response.iter_lines():
if chunk:
print(chunk.decode("utf-8"))
else:
print(f"API Error: {response.text}")
关键配置说明
参数 | 生产环境建议值 | 作用 |
---|
tensor_parallel_size | GPU数量 | 实现多卡并行推理 |
max-num-batched-tokens | ≥10000 | 提升吞吐量的核心参数 |
enforce-eager | 开启 | 避免显存碎片导致OOM |
stream | True | 流式输出降低端到端延迟 |