Traceback (most recent call last): File "/data16/jiugan/code/DEIM-514/train.py", line 93, in <module> main(args) File "/data16/jiugan/code/DEIM-514/train.py", line 64, in main solver.fit(cfg_str) File "/data16/jiugan/code/DEIM-514/engine/solver/det_solver.py", line 86, in fit train_stats = train_one_epoch( ^^^^^^^^^^^^^^^^ File "/data16/jiugan/code/DEIM-514/engine/solver/det_engine.py", line 112, in train_one_epoch outputs = model(samples, targets=targets) # 前向传播 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward else self._run_ddp_forward(*inputs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/jiugan/code/DEIM-514/engine/deim/deim.py", line 29, in forward x = self.decoder(x, targets) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/jiugan/code/DEIM-514/engine/deim/dfine_decoder.py", line 827, in forward self._get_decoder_input(memory, spatial_shapes, denoising_logits, denoising_bbox_unact) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/jiugan/code/DEIM-514/engine/deim/dfine_decoder.py", line 745, in _get_decoder_input anchors, valid_mask = self._generate_anchors(spatial_shapes, device=memory.device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/jiugan/code/DEIM-514/engine/deim/dfine_decoder.py", line 731, in _generate_anchors anchors = torch.concat(anchors, dim=1).to(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1711403408687/work/c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46a4f80d87 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f46a4f3175f in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f46a60628a8 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f46416909ec in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4641694b08 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f464169823a in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f4641698e79 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #7: <unknown function> + 0xd8198 (0x7f469e6eb198 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #8: <unknown function> + 0x94b43 (0x7f46a7094b43 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x126a00 (0x7f46a7126a00 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1711403408687/work/c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46a4f80d87 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f46a4f3175f in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f46a60628a8 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f46416909ec in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4641694b08 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f464169823a in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f4641698e79 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #7: <unknown function> + 0xd8198 (0x7f469e6eb198 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #8: <unknown function> + 0x94b43 (0x7f46a7094b43 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x126a00 (0x7f46a7126a00 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1711403408687/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f46a4f80d87 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xdef733 (0x7f46413ef733 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xd8198 (0x7f469e6eb198 in /data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #3: <unknown function> + 0x94b43 (0x7f46a7094b43 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x126a00 (0x7f46a7126a00 in /lib/x86_64-linux-gnu/libc.so.6) [2025-05-28 13:49:53,599] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 639748) of binary: /data16/home/zjl/miniconda3/envs/deim/bin/python Traceback (most recent call last): File "/data16/home/zjl/miniconda3/envs/deim/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data16/home/zjl/miniconda3/envs/deim/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ======================================================= train.py FAILED ------------------------------------------------------- Failures: <NO_OTHER_FAILURES> ------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2025-05-28_13:49:53 host : cv147cv rank : 0 (local_rank: 0) exitcode : -6 (pid: 639748) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 639748 =======================================================报错
时间: 2025-06-01 20:14:37 浏览: 31
### CUDA error: device-side assert triggered 报错原因及解决方法
#### 错误产生的原因
该错误通常由以下几种常见原因引起:
1. **数据类型不匹配**
如果输入张量和目标张量的数据类型不一致,例如一个张量为 `float32` 而另一个为 `int64`,可能会触发此错误。在 PyTorch 中,某些操作对数据类型有严格要求,必须确保输入和目标张量的数据类型兼容[^1]。
2. **索引超出范围**
当尝试访问张量中不存在的索引时,也可能引发此错误。例如,在进行切片或索引操作时,如果索引值超出了张量的实际大小,CUDA 内核会抛出断言错误[^3]。
3. **未正确初始化的张量**
如果张量在使用前未被正确初始化,或者其内容包含非法值(如 `NaN` 或无穷大),则可能导致内核执行失败并触发断言错误[^2]。
4. **输入值超出合法范围**
某些损失函数(如 `binary_cross_entropy`)要求输入值位于特定范围内。例如,`binary_cross_entropy` 的输入值必须在 `[0, 1]` 区间内。如果输入值超出该范围,则会触发此错误[^4]。
---
#### 解决方案
针对上述原因,可以采取以下措施逐一排查和解决问题:
1. **检查数据类型一致性**
确保所有输入张量和目标张量的数据类型一致。例如,将整数类型的张量转换为浮点类型:
```python
tensor = tensor.to(torch.float32)
```
2. **验证索引范围**
在对张量进行索引或切片操作之前,先检查索引是否在合法范围内。可以通过打印张量形状来确认:
```python
print(tensor.shape)
```
3. **初始化张量**
确保所有张量在使用前已被正确初始化,并且不包含非法值。可以使用以下代码检查张量中是否存在 `NaN` 或无穷大:
```python
if torch.isnan(tensor).any() or torch.isinf(tensor).any():
raise ValueError("Tensor contains NaN or Inf values.")
```
4. **调整输入值范围**
如果使用的是 `binary_cross_entropy` 损失函数,确保输入值位于 `[0, 1]` 区间内。可以通过应用激活函数(如 `sigmoid`)来实现:
```python
output = torch.sigmoid(output)
loss = criterion(output, target)
```
5. **启用调试模式**
为了更准确地定位错误来源,可以在运行脚本前设置环境变量 `CUDA_LAUNCH_BLOCKING=1`。这会使 CUDA 调用同步执行,从而更容易找到具体的错误位置[^3]:
```bash
export CUDA_LAUNCH_BLOCKING=1
python train.py
```
---
#### 代码示例
以下是一个完整的代码示例,展示了如何避免触发 `CUDA error: device-side assert triggered`:
```python
import torch
import torch.nn as nn
import torch.optim as optim
# 初始化模型、损失函数和优化器
model = nn.Linear(10, 1).cuda()
criterion = nn.BCELoss().cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# 构造输入和目标张量
input_tensor = torch.randn(5, 10).cuda()
target_tensor = torch.randint(0, 2, (5, 1)).to(torch.float32).cuda()
# 检查输入张量是否合法
if torch.isnan(input_tensor).any() or torch.isinf(input_tensor).any():
raise ValueError("Input tensor contains NaN or Inf values.")
# 前向传播
output = model(input_tensor)
output = torch.sigmoid(output) # 确保输出值位于 [0, 1] 区间内
# 计算损失
loss = criterion(output, target_tensor)
# 反向传播和参数更新
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Loss: {loss.item()}")
```
---
###
阅读全文
相关推荐


















