pythonCopy codeimport torch
时间: 2024-06-19 12:04:37 浏览: 151
Python是一种面向对象的解释型计算机编程语言[^1]。torch是一个针对深度学习任务的开源机器学习库,它基于Python[^2]。如果您想使用torch库,需要通过以下方式导入它:
```python
import torch
```
导入成功后,就可以使用torch库中提供的各种深度学习函数和工具了。
相关问题
2025-06-26 16:12:03,164 - mmdet - INFO - Saving checkpoint at 12 epochs [ ] 0/81, elapsed: 0s, ETA:/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) [ ] 1/81, 0.3 task/s, elapsed: 3s, ETA: 273s/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/fut_nms_free_coder.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( /media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/map_nms_free_coder.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 81/81, 4.5 task/s, elapsed: 18s, ETA: 0s Traceback (most recent call last): File "tools/train.py", line 266, in <module> main() File "tools/train.py", line 255, in main custom_train_model( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 88, in _do_evaluate key_score = self.evaluate(runner, results) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/datasets/nuscenes_vad_dataset.py", line 1781, in evaluate all_metric_dict[key] += results[i]['metric_results'][key] KeyError: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2522198) of binary: /home/wangbaihui/anaconda3/envs/vad/bin/python /home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2522198 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example: from torch.distributed.elastic.multiprocessing.errors import record @record def trainer_main(args): # do train ********************************************************************** warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module> main() File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper return f(*args, **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** tools/train.py FAILED ======================================= Root Cause: [0]: time: 2025-06-26_16:12:28 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2522198) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> ***********2025-06-26 16:12:03,164 - mmdet - INFO - Saving checkpoint at 12 epochs [ ] 0/81, elapsed: 0s, ETA:/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other) [ ] 1/81, 0.3 task/s, elapsed: 3s, ETA: 273s/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/fut_nms_free_coder.py:78: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( /media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/bbox/coders/map_nms_free_coder.py:82: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). self.post_center_range = torch.tensor( [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 81/81, 4.5 task/s, elapsed: 18s, ETA: 0s Traceback (most recent call last): File "tools/train.py", line 266, in <module> main() File "tools/train.py", line 255, in main custom_train_model( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/VAD/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run epoch_runner(data_loaders[i], **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook getattr(hook, fn_name)(self) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch self._do_evaluate(runner) File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 88, in _do_evaluate key_score = self.evaluate(runner, results) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 361, in evaluate eval_res = self.dataloader.dataset.evaluate( File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/VAD/projects/mmdet3d_plugin/datasets/nuscenes_vad_dataset.py", line 1781, in evaluate all_metric_dict[key] += results[i]['metric_results'][key] KeyError: 0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2522198) of binary: /home/wangbaihui/anaconda3/envs/vad/bin/python /home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2522198 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example: from torch.distributed.elastic.multiprocessing.errors import record @record def trainer_main(args): # do train ********************************************************************** warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module> main() File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper return f(*args, **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** tools/train.py FAILED ======================================= Root Cause: [0]: time: 2025-06-26_16:12:28 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2522198) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> ***********
在 PyTorch 分布式训练过程中出现 `KeyError: 0` 和 `ChildFailedError` 的问题,通常与分布式环境配置不正确或数据加载过程中的异常有关。以下是对这些问题的排查和修复建议:
### KeyError: 'RANK' 及相关环境变量缺失
在某些情况下,启动脚本未正确设置分布式训练所需的环境变量(如 `RANK`, `WORLD_SIZE`, `LOCAL_RANK`),导致程序运行时报错。PyTorch 分布式训练依赖这些环境变量来管理进程组和通信机制。如果使用 `torch.distributed.launch` 或 `torchrun` 启动训练脚本,则需要确保启动命令中包含必要的参数,并且脚本内部能够正确读取这些变量[^3]。
例如,正确的启动命令如下所示:
```bash
python -m torch.distributed.launch --nproc_per_node=2 train.py
```
其中 `--nproc_per_node` 指定了单机上的 GPU 数量,该参数会自动为每个进程分配 `LOCAL_RANK` 环境变量。此外,在脚本中应通过 `argparse` 解析 `--local_rank` 参数并传递给 `torch.distributed.init_process_group`,以完成分布式初始化流程[^4]。
### KeyError: 0 与 NCCL 初始化失败
当使用多机多卡训练时,可能会遇到 `KeyError: 0` 错误,这通常是由 `c10d` key-value store 在获取 `ncclUniqueId` 时失败引起的。此错误可能表明节点之间的网络通信存在问题,或者 NCCL 初始化未能成功执行。可以通过检查以下几点进行排查:
- **网络连接**:确保所有节点之间可以通过指定端口通信,且防火墙规则未阻止相关流量。
- **IP 地址和端口一致性**:确认主节点 IP 地址和通信端口在所有进程中一致,并且在调用 `init_process_group` 时正确设置。
- **NCCL 版本兼容性**:检查使用的 NCCL 版本是否与当前 CUDA 和 PyTorch 版本兼容,必要时升级 NCCL 或调整 PyTorch 版本。
- **资源限制**:确保系统没有达到文件描述符、内存或其他资源的上限,这些也可能影响 NCCL 初始化[^2]。
### ChildFailedError 与子进程崩溃
`ChildFailedError` 表示某个子进程在启动后立即崩溃。这类问题通常由以下原因引起:
- **CUDA 初始化失败**:检查设备是否可用,GPU 驱动是否安装正确,以及 CUDA 版本是否与 PyTorch 兼容。
- **数据加载器问题**:如果在数据加载阶段发生错误,可能导致子进程提前终止。可以尝试将 `num_workers` 设置为 0 来排除 DataLoader 中的问题。
- **代码逻辑错误**:某些特定于进程的操作(如模型初始化、数据分片)可能在某个进程中触发异常。可以在子进程中加入日志记录或断言检查,定位具体出错位置。
### 数据处理中的 KeyError: '000164'
在 Opencood 项目中,训练过程中出现偶数编号的 `KeyError: '000164'` 报错,可能是由于数据索引或标注文件解析错误所致。这种现象通常出现在自定义数据集加载器中,尤其是在构建样本路径或读取标注信息时未能正确处理部分条目。建议采取以下措施:
- **检查数据索引一致性**:确保数据集索引列表完整,不存在缺失或重复的 ID。
- **验证标注文件格式**:确认标注文件结构与代码预期一致,特别是字段名称和类型。
- **添加异常捕获机制**:在数据加载函数中加入 `try-except` 块,跳过有问题的数据项并输出警告信息,以便继续训练[^1]。
### 示例代码:修改数据加载器以忽略错误数据项
```python
def __getitem__(self, index):
try:
# 正常数据加载逻辑
return self._load_data(index)
except KeyError as e:
print(f"Skipping corrupted data at index {index}: {e}")
return self.__getitem__((index + 1) % len(self))
```
上述方法可在不影响整体训练流程的前提下,临时绕过损坏或格式不正确的数据项。
torch的scatter函数
### PyTorch Scatter Function Usage
The `torch.scatter` function allows elements to be copied from the source tensor to specified positions within an output tensor based on provided indices. This operation is particularly useful when performing operations such as updating specific locations in tensors or implementing certain types of neural network layers.
#### Syntax and Parameters
```python
torch.scatter(input, dim, index, src)
```
- **input**: The input tensor where updates will occur.
- **dim**: Dimension along which to index.
- **index**: Tensor containing the indices of elements to update.
- **src**: Source element(s) to copy from; can either be a scalar value or a tensor with dimensions matching those of `input`.
If `src` is a tensor, it must match the shape of `input`, excluding the dimension being indexed by `index`. If `src` is a scalar, its value gets broadcasted across all selected indices during assignment.
#### Example Code Demonstrating Use Cases
Below are some practical examples demonstrating how to use the `scatter` method:
##### Case 1: Updating Elements Based on Indices
This case shows setting values at given positions according to another tensor's content.
```python
import torch
# Create initial tensor filled with zeros
tensor = torch.zeros(3, 4)
# Define indices indicating position changes
indices = torch.tensor([[0], [2]])
# Values that should replace existing ones at indicated spots
updates = torch.full((2, 4), fill_value=99.)
result = tensor.scatter(dim=0, index=indices, src=updates)
print(result)
```
Output:
```
tensor([[99., 99., 99., 99.],
[ 0., 0., 0., 0.],
[99., 99., 99., 99.]])
```
##### Case 2: Broadcasting Scalar Value Across Selected Positions
Here demonstrates broadcasting a single number over multiple places defined via indexing.
```python
scalar_update_tensor = torch.ones_like(tensor).float()
new_result = result.scatter_(dim=0, index=torch.LongTensor([[0]]), src=-8.)
print(new_result)
```
Note here `-8.` replaces only first row because our `index` points there explicitly while keeping other rows untouched.
Output:
```
tensor([[-8., -8., -8., -8.],
[ 0., 0., 0., 0.],
[99., 99., 99., 99.]])
```
For more advanced applications involving multi-dimensional data structures, refer back to official documentation resources like NumPy introduction[^1].
阅读全文
相关推荐












