【已解决】RuntimeError: CUDA error: device-side assert triggered

virobotics

于 2024-11-25 16:19:04 发布

阅读量1.5k

点赞数 28

CC 4.0 BY-SA版权

分类专栏：奇怪问题及bug解决文章标签： labview 人工智能 YOLO 深度学习模型训练

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/virobotics/article/details/144026529

奇怪问题及bug解决专栏收录该内容

4 篇文章

订阅专栏

‍‍🏡博客主页： virobotics(仪酷智能)：LabVIEW深度学习、人工智能博主
🎄所属专栏：『奇怪问题及Bug解决』
📑精选文章：LabVIEW人工智能深度学习指南
🍻本文由virobotics(仪酷智能)原创

🥳欢迎大家关注✌点赞👍收藏⭐留言📝订阅专栏

在这里插入图片描述

🫧 所遇问题

在使用 YOLOv8_seg 训练模型时，可能会遇到类似以下的错误：

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

如果你发现其他数据集可以正常训练，而只有某个特定数据集出现该错误，那么问题很可能出在数据集本身。本文将从多个角度详细介绍如何排查和解决这种问题。

⚙️ 解决办法

1. 检查数据集标签

问题：标签中的类别索引超出了模型定义的 num_classes 范围。

解决方法：

检查数据集中的所有标签文件，确保类别索引在 [0, num_classes - 1] 范围内。
检查配置文件，确保 num_classes 和数据集中实际的类别数量一致。

使用python脚本快速查找无效标签：

import os

# 分类数量
num_classes = 8

# 标签文件目录
label_dir = "labels"

# 检查目录是否存在
if not os.path.exists(label_dir):
    print(f"Directory '{label_dir}' does not exist.")
    exit(1)

# 遍历目录中的所有 .txt 文件
for file_name in os.listdir(label_dir):
    if file_name.endswith(".txt"):
        file_path = os.path.join(label_dir, file_name)
        print(f"Processing file: {file_path}")

        # 打开文件并逐行读取
        with open(file_path, "r") as file:
            for line_number, line in enumerate(file, start=1):
                line = line.strip()  # 去除首尾空白字符
                if line:  # 确保行不为空
                    fields = line.split()  # 根据空格分割
                    try:
                        # 将第一列作为浮点数读取，并转换为整数
                        first_field = int(float(fields[0]))
                        # 检查是否符合条件
                        if first_field < 0 or first_field >= num_classes:
                            print(f"{file_name}, Line {line_number}: {line}")
                    except ValueError:
                        # 捕获无法转换为数字的数据
                        print(f"Invalid line in {file_name}, Line {line_number}: {line}")

随机抽取部分样本进行可视化，确认标签是否正确标注并与图像匹配。

2. 关闭数据增强

数据增强可能在处理过程中引入了错误或不兼容的数据格式。

问题：增强操作生成了无效标签或不兼容的格式。
解决方法：
- 暂时关闭所有数据增强，直接使用原始数据进行训练。
- 如果问题解决，逐步开启数据增强功能，找出引发问题的具体增强操作。

3. 检查数据中的 NaN 或 Inf

数据集中可能包含无效值（如 NaN 或 Inf），会导致训练失败。

问题：训练数据或标签中存在 NaN 或 Inf 值。

解决方法：

检查数据加载器是否加载了无效数据：

import torch
for batch in dataloader:
    if not torch.isfinite(batch).all():
        print("Invalid data detected!")

如果检测到无效数据，追溯其来源并修复问题。

5. 检查数据文件路径是否正确

训练数据或标签路径可能存在无效项。

问题：某些图像或标签文件路径无效，导致数据加载失败。

解决方法：

检查数据集中所有文件的路径是否正确且文件存在：

import os

def check_files(file_list):
    for file in file_list:
        if not os.path.exists(file):
            print(f"File missing: {file}")

check_files(image_list)  # 替换为你的图像路径列表
check_files(label_list)  # 替换为你的标签路径列表

6. 检查类别分布是否不均衡

类别样本数量的极度不均衡可能会引发训练问题。

问题：某些类别的样本过少，导致训练过程中频繁缺失目标。

解决方法：

统计每个类别的样本数量，查看分布是否合理：

import numpy as np

label_counts = [0] * num_classes
for mask_file in os.listdir(masks_path):
    mask = cv2.imread(os.path.join(masks_path, mask_file), cv2.IMREAD_UNCHANGED)
    unique, counts = np.unique(mask, return_counts=True)
    for u, c in zip(unique, counts):
        label_counts[u] += c
print("Class distribution:", label_counts)