pytorch 1.7训练保存的模型在1.4低版本无法加载:frame #63: <unknown function> + 0x1db3e0 (0x55ba98ddd3e0 in /data/user

在使用PyTorch时,发现1.7版本训练的模型无法在1.4低版本中加载,报错。解决方案是在1.7版本中保存模型时使用`torch.hub.load_state_dict_from_url`,然后在1.4版本中加载。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

pytorch 1.7高版本训练保存的模型在1.4低版本无法加载,报错:

torch.load('/home/user1/model_best_b.pth.tar')
Traceback (most recent call last):
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-13d633918c2f>", l
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f203e86aa6d in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f203e86c7f0 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f203e86defd in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xd6df4 (0x7f202e850df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x8609 (0x7f21357f2609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f21355b1353 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f203d54f5e8 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b4abe (0x7f203e83cabe in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xe07bed (0x7f203e48fbed in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xd6df4 (0x7f202e850df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x8609 (0x7f21357f2609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #5: clone + 0x43 (0x7f21355b1353 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=719, OpType=_ALLGATHER_BASE, NumelIn=8, NumelOut=16, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f1625eb15e8 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f16271cca6d in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f16271ce7f0 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f16271cfefd in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xd6df4 (0x7f16171b2df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x8609 (0x7f171e154609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f171df13353 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f1625eb15e8 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b4abe (0x7f162719eabe in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xe07bed (0x7f1626df1bed in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xd6df4 (0x7f16171b2df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x8609 (0x7f171e154609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #5: clone + 0x43 (0x7f171df13353 in /lib/x86_64-linux-gnu/libc.so.6)
最新发布
07-17
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值