pythonCopy codeimport torch

Python是一种面向对象的解释型计算机编程语言[^1]。torch是一个针对深度学习任务的开源机器学习库，它基于Python[^2]。如果您想使用torch库，需要通过以下方式导入它： ```python import torch ``` 导入成功后，就可以使用torch库中提供的各种深度学习函数和工具了。

2025-06-26 16:12:03,164 - mmdet - INFO - Saving checkpoint at 12 epochs [ ] 0/81, elapsed: 0s, ETA:/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) [ ] 1/81, 0.3 task/s, elapsed: 3s, ETA: 273s/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/fut_nms_free_coder.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( /media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/map_nms_free_coder.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 81/81, 4.5 task/s, elapsed: 18s, ETA: 0s Traceback (most recent call last): File "tools/train.py", line 266, in <module> main() File "tools/train.py", line 255, in main custom_train_model( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 88, in _do_evaluate key_score = self.evaluate(runner, results) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/datasets/nuscenes_vad_dataset.py", line 1781, in evaluate all_metric_dict[key] += results[i]['metric_results'][key] KeyError: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2522198) of binary: /home/wangbaihui/anaconda3/envs/vad/bin/python /home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning: CHILD PROCESS FAILED WITH NO ERROR_FILE CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2522198 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example: from torch.distributed.elastic.multiprocessing.errors import record @record def trainer_main(args): # do train **** warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module> main() File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper return f(*args, kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ******************************* tools/train.py FAILED ======================================= Root Cause: [0]: time: 2025-06-26_16:12:28 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2522198) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> *2025-06-26 16:12:03,164 - mmdet - INFO - Saving checkpoint at 12 epochs [ ] 0/81, elapsed: 0s, ETA:/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) [ ] 1/81, 0.3 task/s, elapsed: 3s, ETA: 273s/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/fut_nms_free_coder.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( /media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/map_nms_free_coder.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 81/81, 4.5 task/s, elapsed: 18s, ETA: 0s Traceback (most recent call last): File "tools/train.py", line 266, in <module> main() File "tools/train.py", line 255, in main custom_train_model( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 88, in _do_evaluate key_score = self.evaluate(runner, results) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/datasets/nuscenes_vad_dataset.py", line 1781, in evaluate all_metric_dict[key] += results[i]['metric_results'][key] KeyError: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2522198) of binary: /home/wangbaihui/anaconda3/envs/vad/bin/python /home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning: CHILD PROCESS FAILED WITH NO ERROR_FILE CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2522198 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example: from torch.distributed.elastic.multiprocessing.errors import record @record def trainer_main(args): # do train **** warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module> main() File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper return f(*args, kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************** tools/train.py FAILED ======================================= Root Cause: [0]: time: 2025-06-26_16:12:28 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2522198) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> *

在 PyTorch 分布式训练过程中出现 `KeyError: 0` 和 `ChildFailedError` 的问题，通常与分布式环境配置不正确或数据加载过程中的异常有关。以下是对这些问题的排查和修复建议： ### KeyError: 'RANK' 及相关环境变量缺失在某些情况下，启动脚本未正确设置分布式训练所需的环境变量（如 `RANK`, `WORLD_SIZE`, `LOCAL_RANK`），导致程序运行时报错。PyTorch 分布式训练依赖这些环境变量来管理进程组和通信机制。如果使用 `torch.distributed.launch` 或 `torchrun` 启动训练脚本，则需要确保启动命令中包含必要的参数，并且脚本内部能够正确读取这些变量[^3]。例如，正确的启动命令如下所示： ```bash python -m torch.distributed.launch --nproc_per_node=2 train.py ``` 其中 `--nproc_per_node` 指定了单机上的 GPU 数量，该参数会自动为每个进程分配 `LOCAL_RANK` 环境变量。此外，在脚本中应通过 `argparse` 解析 `--local_rank` 参数并传递给 `torch.distributed.init_process_group`，以完成分布式初始化流程[^4]。 ### KeyError: 0 与 NCCL 初始化失败当使用多机多卡训练时，可能会遇到 `KeyError: 0` 错误，这通常是由 `c10d` key-value store 在获取 `ncclUniqueId` 时失败引起的。此错误可能表明节点之间的网络通信存在问题，或者 NCCL 初始化未能成功执行。可以通过检查以下几点进行排查： - **网络连接**：确保所有节点之间可以通过指定端口通信，且防火墙规则未阻止相关流量。 - **IP 地址和端口一致性**：确认主节点 IP 地址和通信端口在所有进程中一致，并且在调用 `init_process_group` 时正确设置。 - **NCCL 版本兼容性**：检查使用的 NCCL 版本是否与当前 CUDA 和 PyTorch 版本兼容，必要时升级 NCCL 或调整 PyTorch 版本。 - **资源限制**：确保系统没有达到文件描述符、内存或其他资源的上限，这些也可能影响 NCCL 初始化[^2]。 ### ChildFailedError 与子进程崩溃 `ChildFailedError` 表示某个子进程在启动后立即崩溃。这类问题通常由以下原因引起： - **CUDA 初始化失败**：检查设备是否可用，GPU 驱动是否安装正确，以及 CUDA 版本是否与 PyTorch 兼容。 - **数据加载器问题**：如果在数据加载阶段发生错误，可能导致子进程提前终止。可以尝试将 `num_workers` 设置为 0 来排除 DataLoader 中的问题。 - **代码逻辑错误**：某些特定于进程的操作（如模型初始化、数据分片）可能在某个进程中触发异常。可以在子进程中加入日志记录或断言检查，定位具体出错位置。 ### 数据处理中的 KeyError: '000164' 在 Opencood 项目中，训练过程中出现偶数编号的 `KeyError: '000164'` 报错，可能是由于数据索引或标注文件解析错误所致。这种现象通常出现在自定义数据集加载器中，尤其是在构建样本路径或读取标注信息时未能正确处理部分条目。建议采取以下措施： - **检查数据索引一致性**：确保数据集索引列表完整，不存在缺失或重复的 ID。 - **验证标注文件格式**：确认标注文件结构与代码预期一致，特别是字段名称和类型。 - **添加异常捕获机制**：在数据加载函数中加入 `try-except` 块，跳过有问题的数据项并输出警告信息，以便继续训练[^1]。 ### 示例代码：修改数据加载器以忽略错误数据项 ```python def __getitem__(self, index): try: # 正常数据加载逻辑 return self._load_data(index) except KeyError as e: print(f"Skipping corrupted data at index {index}: {e}") return self.__getitem__((index + 1) % len(self)) ``` 上述方法可在不影响整体训练流程的前提下，临时绕过损坏或格式不正确的数据项。

torch的scatter函数

### PyTorch Scatter Function Usage The `torch.scatter` function allows elements to be copied from the source tensor to specified positions within an output tensor based on provided indices. This operation is particularly useful when performing operations such as updating specific locations in tensors or implementing certain types of neural network layers. #### Syntax and Parameters ```python torch.scatter(input, dim, index, src) ``` - **input**: The input tensor where updates will occur. - **dim**: Dimension along which to index. - **index**: Tensor containing the indices of elements to update. - **src**: Source element(s) to copy from; can either be a scalar value or a tensor with dimensions matching those of `input`. If `src` is a tensor, it must match the shape of `input`, excluding the dimension being indexed by `index`. If `src` is a scalar, its value gets broadcasted across all selected indices during assignment. #### Example Code Demonstrating Use Cases Below are some practical examples demonstrating how to use the `scatter` method: ##### Case 1: Updating Elements Based on Indices This case shows setting values at given positions according to another tensor's content. ```python import torch # Create initial tensor filled with zeros tensor = torch.zeros(3, 4) # Define indices indicating position changes indices = torch.tensor([[0], [2]]) # Values that should replace existing ones at indicated spots updates = torch.full((2, 4), fill_value=99.) result = tensor.scatter(dim=0, index=indices, src=updates) print(result) ``` Output: ``` tensor([[99., 99., 99., 99.], [ 0., 0., 0., 0.], [99., 99., 99., 99.]]) ``` ##### Case 2: Broadcasting Scalar Value Across Selected Positions Here demonstrates broadcasting a single number over multiple places defined via indexing. ```python scalar_update_tensor = torch.ones_like(tensor).float() new_result = result.scatter_(dim=0, index=torch.LongTensor([[0]]), src=-8.) print(new_result) ``` Note here `-8.` replaces only first row because our `index` points there explicitly while keeping other rows untouched. Output: ``` tensor([[-8., -8., -8., -8.], [ 0., 0., 0., 0.], [99., 99., 99., 99.]]) ``` For more advanced applications involving multi-dimensional data structures, refer back to official documentation resources like NumPy introduction[^1].

阅读全文

pythonCopy codeimport torch

torch的scatter函数

相关推荐

pytorch文档

PyTorch-Coder

Pycharm中import torch报错的快速解决方法

【Python工具箱应用】

Python语言程序设计第9周：全面解析Python计算生态系统

Python在移动设备上的机器视觉实现

【FFmpeg专家指南】：Python高级视频帧提取技术

【深度学习与Python】：智能化学生画像分析，提升画像精准度

【多环境部署】：Python EasyOCR在各操作系统上的终极部署指南

python+Playwright绕过验证码

python调用ai接入Django

五G通信关键技术课件.ppt

基于51单片机的多功能电子时钟汇编程序设计与实现

工程项目管理实施方案.doc

综合布线施工工艺和技术专题培训课件.ppt

叉车液压系统集成块及其加工工艺的设计.doc

大家在看

Winform程序使用验证码

mssdk10130048en MsSDK u14

prophecypracticum_django

电力系统微网故障检测数据集及代码python

flow-3D客制化流程

最新推荐

五G通信关键技术课件.ppt

基于51单片机的多功能电子时钟汇编程序设计与实现

工程项目管理实施方案.doc

模拟电子技术基础学习指导与习题精讲

【5G通信背后的秘密】：极化码与SCL译码技术的极致探索

谷歌浏览器中如何使用hackbar

一步搞定局域网共享设置的超级工具

PBIDesktop在Win7上的终极安装秘籍：兼容性问题一次性解决！

PC-lint 8.0升级至'a'级的patch安装指南