快速本地化部署Qwen2.5-VL模型并解决安装flash-attn及模型权重下载中断问题

陈嘿萌

已于 2025-06-28 10:20:45 修改

阅读量2.1k

点赞数 19

CC 4.0 BY-SA版权

分类专栏： Qwen 报错与BUG 文章标签： Qwen2.5-VL 安装flash-attn 下载中断 LLM安装

于 2025-06-27 12:14:40 首次发布

本文链接：https://blog.csdn.net/weixin_43312117/article/details/148948104

报错与BUG 同时被 2 个专栏收录

3 篇文章

订阅专栏

Qwen

1 篇文章

订阅专栏

文章目录

flash-attn下载存在问题的读者，请直接跳转并阅读第四步：离线下载flash-attention。

前言

由于LLM参数量巨大、且国内网络不佳，在服务器部署Qwen系列模型会存在缓存模型权重难等问题。依照Qwen官方的README部署模型会存在问题，比如模型下载过程中进度卡死、flash-attn相关库难以安装等问题。因为，为了解决上述问题，将整理一个教程，快速部署Qwen系列模型。

Qwen官网

https://github.com/QwenLM/Qwen2.5-VL

第一步：新建虚拟环境

conda create -n Qwen python=3.10.16 
conda activate Qwen

在这里插入图片描述

第二步：设置加速源

pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/

在这里插入图片描述

第三步：安装基础环境

创建Qwen文件夹，并新建requirements.txt，把下面内容复制到里面。

requirements.txt中的内容如下，复制下面内容到txt文件中

torch==2.6.0
torchvision==0.21.0
transformers==4.51.3
accelerate
qwen-vl-utils[decord]

安装需要的包

pip install -r requirements.txt

在这里插入图片描述

第四步：离线下载flash-attention

下载地址：https://github.com/Dao-AILab/flash-attention/releases/

找到你对应的版本flash_attn-2.74.post（cuda我是安装的12.0，其它的步骤一样就可以选择下面这个包，记得选择False这个版本。）

选择自己对应的离线版本即可。

下载版本：flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

CSDN下载：https://download.csdn.net/download/weixin_43312117/91177518?spm=1001.2014.3001.5501
在这里插入图片描述

把下载的离线文件上传到服务器当前用户的Download文件夹中，然后安装该文件

pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

在这里插入图片描述

第五步：下载模型权重

近期，用huggface镜像下载会出现不稳定的情况；请改用modelscope下载权重

# 用魔塔社区下载
pip install modelscope
# 下载时指定路径，加载模型的时候要指向到该路径
modelscope download --model Qwen/Qwen2.5-VL-3B-Instruct --local_dir ./Qwen2.5-VL-3B-Instruct

在这里插入图片描述

下载完成

在这里插入图片描述

第六步：运行代码：修改对应模型权重路径

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "./Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processor
processor = AutoProcessor.from_pretrained("./Qwen2.5-VL-3B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)