解决大模型训练中的CUDA out of memory

天堂树4711

已于 2024-06-14 23:47:59 修改

阅读量1.2k

点赞数 9

CC 4.0 BY-SA版权

分类专栏：安装及解决记录文章标签：人工智能 llama

于 2024-06-14 23:28:51 首次发布

本文链接：https://blog.csdn.net/weixin_52147110/article/details/139691431

安装及解决记录专栏收录该内容

5 篇文章

订阅专栏

在训练Llama-3-8B模型的时候遇到了如下报错

OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 6.00 GiB total capacity; 55.79 GiB already allocated; 0 bytes free; 55.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

在这篇文章win10+pytorch1.9+yolox模型训练_win10 pytorch 训练表情训练-CSDN博客看到解决方法是修改模型训练中的精度，于是尝试做了修改并实现了成功。

具体思路如下：

首先查看了训练 SFTTrainer 中的精度（sft_trainer.py 官方文档 line253）如下，说明只有当模型是4bit且不是 shared QLoRA 时，才会调用 peft_module_casting_to_bf16 函数，将PEFT模块转换为bf16（bfloat16）精度。该模型符合要求，因此去修改peft模块中的参数 bf16=False，但是还是报错，后来逐步检查发现是在模型训练初始阶段设置了bf16，将其改为float16就可以了。

关于bf16 和fp16 ，fp32这些的区别，可以参考bf16 和fp16 ，fp32的区别以及相互转换逻辑_fp16 bf16-CSDN博客

if (
        args is not None
        and args.bf16
        and getattr(model, "is_loaded_in_4bit", False)
        and not is_sharded_qlora
):
    peft_module_casting_to_bf16(model)

具体修改如下：

模型参数配置如下，预先通过 bnb_config 设置了量化和模型压缩的参数，其中bnb_4bit_compute_dtype = torch.bfloat16, 将其修改为 torch.float16 即可。

model = AutoModelForCausalLM.from_pretrained(
    'LLM-Research/Meta-Llama-3-8B-Instruct',
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config
)

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,  # torch.bfloat16 --→ torch.float16
        bnb_4bit_use_double_quant=True)

我认为这样设置就已经实现了训练中使用 float16 而非 bf16，不过还是在训练参数training_arguments中设置了bf16=False，解决了CUDA out of memory问题。

至于报错中建议的调整max_split_size_mb参数方法，没有尝试这个方法，也许也可以解决。

补充一个 huggingface 链接，同样的问题和解决方法。meta-llama/Meta-Llama-3-8B · torch.cuda.OutOfMemoryError: CUDA out of memory. (huggingface.co)