1. 问题描述
在训练某个使用了Minkowski Engine的代码时,报错:
File "train_singleGPU.py", line 88, in main
loss.backward()
File "/home/haoxin/anaconda3/envs/gapr/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/haoxin/anaconda3/envs/gapr/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
训练另外一个使用了Minkowski Engine的代码egonn,报同样的错误,不由得让我觉得是Minkowski Engine的问题。
当前配置:
RTX 3060
cuda11.1
torch1.9.1+cu111
2. 解决办法
发现两个代码都使用cuda10.2,结合网上所说cuda11.1下Minkowski Engine有bug,于是决定换版本。
1)更换cuda版本为11.3,安装时注意不要选择driver
wget https://developer.download.nvidia.com/compute/cuda/11.3.0/local_installers/cuda_11.3.0_465.19.01_linux.run
sudo sh cuda_11.3.0_465.19.01_linux.run
2)更换pytorch
pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
3)重新编译Minkowski Engine
cd MinkowskiEngine/
export TORCH_CUDA_ARCH_LIST="8.0"
export CXX=c++; export CUDA_HOME=/usr/local/cuda-11.3; python setup.py install --blas=openblas --force_cuda
然后可以运行训练代码。