活动介绍

[rank0]: Traceback (most recent call last): [rank0]: File "test.py", line 213, in <module> [rank0]: main() [rank0]: File "test.py", line 209, in main [rank0]: eval_single_ckpt(model, test_loader, args, eval_output_dir, logger, epoch_id, dist_test=dist_test) [rank0]: File "test.py", line 72, in eval_single_ckpt [rank0]: eval_utils.eval_one_epoch( [rank0]: File "/home/mtr/MTR/tools/eval_utils/eval_utils.py", line 42, in eval_one_epoch [rank0]: for i, batch_dict in enumerate(dataloader): [rank0]: File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in __next__ [rank0]: data = self._next_data() [rank0]: File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data [rank0]: return self._process_data(data) [rank0]: File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data [rank0]: data.reraise() [rank0]: File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/_utils.py", line 706, in reraise [rank0]: raise exception [rank0]: AssertionError: Caught AssertionError in DataLoader worker process 2. [rank0]: Original Traceback (most recent call last): [rank0]: File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank0]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank0]: File "/root/anaconda3/envs/mtr/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp> [rank0]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank0]: File "/home/mtr/MTR/tools/../mtr/datasets/waymo/waymo_dataset.py", line 68, in __getitem__ [rank0]:

时间: 2025-07-12 12:42:15 浏览: 12
<think>好的,我现在需要解决用户遇到的Python DataLoader引发的AssertionError问题。用户提到错误信息是“AssertionError: Caught AssertionError in DataLoader worker process 0”,并且引用了几个相关的错误示例,比如修改图片格式和路径的例子。首先,我需要理解这个错误的常见原因和解决方法。 首先,DataLoader在加载数据时出现断言错误,通常和数据预处理、数据路径或数据格式有关。根据引用[3]中的例子,用户通过将图片格式从jpg改为png解决了问题,这说明可能是文件路径或格式不正确导致的。因此,我应该先检查数据加载部分是否正确,特别是文件路径是否存在,文件名是否正确,以及文件是否损坏。 接下来,我需要考虑数据预处理过程中的断言条件。例如,可能在代码中有assert语句检查数据的形状或类型,如果数据不符合预期就会触发AssertionError。这时候需要检查数据预处理步骤,确保每个返回的数据样本都是正确的,比如图像张量的形状、标签的类型等。 另外,多线程问题也可能导致此类错误。DataLoader使用多进程(worker)加载数据,如果数据集或数据加载函数中存在非线程安全的操作,可能会引发问题。比如,在自定义的数据集中使用了共享资源而没有加锁,或者某些操作不支持多进程。这时候可以尝试将num_worker设置为0,排除多进程的影响,看看错误是否消失。 此外,数据损坏或格式错误也是一个可能的原因。例如,某些图片文件可能虽然扩展名正确,但实际内容损坏,导致读取时失败。这种情况下,需要检查数据文件是否完整,或者尝试用其他方式读取文件,比如使用PIL或OpenCV的不同参数。 最后,根据用户提供的引用,修改文件路径中的扩展名解决了问题,这可能是因为代码中硬编码了文件扩展名,而实际文件使用的是不同的格式。因此,检查数据路径的生成逻辑是否正确,是否与实际文件匹配,是解决问题的关键步骤。 总结可能的解决步骤:检查数据路径和格式是否正确,确保文件存在;检查数据预处理中的断言条件;调整num_workers参数测试;验证数据完整性;检查多线程安全性。需要逐一排查这些可能性,找到具体的错误原因。</think>### 解决DataLoader引发的AssertionError问题步骤 当遇到`AssertionError: Caught AssertionError in DataLoader worker process 0`错误时,通常与数据加载、多线程处理或数据格式相关。以下是逐步排查和解决方案: --- #### **1. 检查数据路径与文件格式** - **问题原因**:数据路径错误或文件扩展名与实际格式不匹配(如代码预期读取`.jpg`但实际文件是`.png`)。 - **解决方案**: - 确认数据集路径是否正确,文件名是否完整。 - 检查代码中文件路径的生成逻辑,确保扩展名与实际文件一致。 - **示例修改**(参考引用[3]): ```python # 原代码可能为: self._imgpath = os.path.join("%s", "JPEGImages", "%s.jpg") # 若实际是png文件,应修改为: self._imgpath = os.path.join("%s", "JPEGImages", "%s.png") ``` --- #### **2. 验证数据预处理逻辑** - **问题原因**:自定义数据集的预处理代码中存在断言(`assert`)失败,例如数据维度不匹配或标签越界。 - **解决方案**: - 检查`__getitem__`方法中的数据处理步骤,尤其是`assert`语句。 - 确保返回的样本(如图像张量、标签)符合模型输入要求。 - **示例调试**: ```python def __getitem__(self, index): img_path = self.img_list[index] label = self.label_list[index] img = Image.open(img_path).convert("RGB") # 检查图像是否成功加载 assert img is not None, f"Failed to load image: {img_path}" return img, label ``` --- #### **3. 调整DataLoader的num_workers参数** - **问题原因**:多线程数据加载(`num_workers > 0`)时,部分操作可能不支持多进程(如文件句柄共享或全局变量修改)。 - **解决方案**: - 将`num_workers`设为0,测试是否仍报错。若错误消失,说明问题与多线程相关。 - 检查数据加载代码是否包含非线程安全的操作(如动态修改全局变量)。 - **示例修改**: ```python dataloader = DataLoader(dataset, batch_size=32, num_workers=0) # 临时关闭多线程 ``` --- #### **4. 检查数据完整性** - **问题原因**:数据文件损坏或格式异常(如损坏的图片文件)。 - **解决方案**: - 使用工具(如`PIL.Image.open()`或`cv2.imread()`)遍历数据集,确认所有文件可正常读取。 - 添加异常捕获机制,跳过损坏文件。 - **示例代码**: ```python from PIL import Image def safe_load_image(path): try: img = Image.open(path) img.verify() # 验证文件完整性 return img except Exception as e: print(f"Corrupted file: {path}, error: {e}") return None ``` --- #### **5. 检查第三方库版本兼容性** - **问题原因**:PyTorch或相关库版本不兼容导致断言失败。 - **解决方案**: - 更新PyTorch、Torchvision到最新稳定版本。 - 检查代码是否依赖特定版本的功能。 --- ### 总结 通过上述步骤,可定位到具体原因: 1. 文件路径或格式错误 → 修正路径或扩展名[^3]。 2. 数据预处理断言失败 → 调整逻辑或数据清洗。 3. 多线程问题 → 减少`num_workers`或优化代码。 4. 数据损坏 → 修复或跳过损坏文件。 ---
阅读全文

相关推荐

Traceback (most recent call last): File "heyuanjumin.py", line 7, in <module> File "<frozen importlib._bootstrap>", line 1007, in _find_and_load File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 680, in _load_unlocked File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "ultralytics\__init__.py", line 5, in <module> from ultralytics.data.explorer.explorer import Explorer File "<frozen importlib._bootstrap>", line 1007, in _find_and_load File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 680, in _load_unlocked File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "ultralytics\data\__init__.py", line 3, in <module> from .base import BaseDataset File "<frozen importlib._bootstrap>", line 1007, in _find_and_load File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 680, in _load_unlocked File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "ultralytics\data\base.py", line 17, in <module> from ultralytics.utils import DEFAULT_CFG, LOCAL_RANK, LOGGER, NUM_THREADS, TQDM File "<frozen importlib._bootstrap>", line 1007, in _find_and_load File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 680, in _load_unlocked File "PyInstaller\loader\pyimod02_importers.py", line 450, in exec_module File "ultralytics\utils\__init__.py", line 271, in <module> LOGGER = set_logging(LOGGING_NAME, verbose=VERBOSE) # define globally (used in train.py, val.py, predict.py, etc.) File "ultralytics\utils\__init__.py", line 238, in set_logging if WINDOWS and sys.stdout.encoding != "utf-8": AttributeError: 'NoneType' object has no attribute 'encoding'

[rank0]: Traceback (most recent call last): [rank0]: File "train.py", line 113, in <module> [rank0]: main(args) [rank0]: File "train.py", line 75, in main [rank0]: solver.fit() [rank0]: File "/root/D-FINE-master/src/solver/det_solver.py", line 76, in fit [rank0]: train_stats = train_one_epoch( [rank0]: File "/root/D-FINE-master/src/solver/det_engine.py", line 59, in train_one_epoch [rank0]: for i, (samples, targets) in enumerate( [rank0]: File "/root/D-FINE-master/src/misc/logger.py", line 217, in log_every [rank0]: for obj in iterable: [rank0]: File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in __next__ [rank0]: data = self._next_data() [rank0]: File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data [rank0]: return self._process_data(data) [rank0]: File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data [rank0]: data.reraise() [rank0]: File "/root/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 706, in reraise [rank0]: raise exception [rank0]: NameError: Caught NameError in DataLoader worker process 0. [rank0]: Original Traceback (most recent call last): [rank0]: File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: File "/root/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch [rank0]: return self.collate_fn(data) [rank0]: File "/root/D-FINE-master/src/data/dataloader.py", line 111, in __call__ [rank0]: if isinstance(img, PIL.Image.Image): [rank0]: NameError: name 'PIL' is not defined E0716 17:34:01.237514 139855299663040 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 799816) of binary: /root/miniconda3/bin/python Traceback (most recent call last): File "/root/miniconda3/bin/torchrun", line 8, in <module> sys.exit(main()) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper return f(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-07-16_17:34:01 host : autodl-container-9efe41855c-0b570216 rank : 0 (local_rank: 0) exitcode : 1 (pid: 799816) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

D:\Anaconda_envs\envs\cpd\python.exe E:\CarPlateDetection\CarPlateDetection\MainProgram.py RuntimeError: module compiled against ABI version 0x1000009 but this version of numpy is 0x2000000 Traceback (most recent call last): File "E:\CarPlateDetection\CarPlateDetection\MainProgram.py", line 8, in <module> from ultralytics import YOLO File "D:\Anaconda_envs\envs\cpd\lib\site-packages\ultralytics\__init__.py", line 5, in <module> from ultralytics.models import RTDETR, SAM, YOLO File "D:\Anaconda_envs\envs\cpd\lib\site-packages\ultralytics\models\__init__.py", line 3, in <module> from .rtdetr import RTDETR File "D:\Anaconda_envs\envs\cpd\lib\site-packages\ultralytics\models\rtdetr\__init__.py", line 3, in <module> from .model import RTDETR File "D:\Anaconda_envs\envs\cpd\lib\site-packages\ultralytics\models\rtdetr\model.py", line 10, in <module> from ultralytics.engine.model import Model File "D:\Anaconda_envs\envs\cpd\lib\site-packages\ultralytics\engine\model.py", line 8, in <module> from ultralytics.cfg import TASK2DATA, get_cfg, get_save_dir File "D:\Anaconda_envs\envs\cpd\lib\site-packages\ultralytics\cfg\__init__.py", line 10, in <module> from ultralytics.utils import (ASSETS, DEFAULT_CFG, DEFAULT_CFG_DICT, DEFAULT_CFG_PATH, LOGGER, RANK, ROOT, RUNS_DIR, File "D:\Anaconda_envs\envs\cpd\lib\site-packages\ultralytics\utils\__init__.py", line 18, in <module> import cv2 File "D:\Anaconda_envs\envs\cpd\lib\site-packages\cv2\__init__.py", line 181, in <module> bootstrap() File "D:\Anaconda_envs\envs\cpd\lib\site-packages\cv2\__init__.py", line 153, in bootstrap native_module = importlib.import_module("cv2") File "D:\Anaconda_envs\envs\cpd\lib\importlib\__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) ImportError: numpy.core.multiarray failed to import

Traceback (most recent call last): File "/group/40156/ruidonghan/one_rec/generative-recommenders/generative_recommenders/dlrm_v3/train/train_ranker.py", line 27, in <module> from generative_recommenders.dlrm_v3.train.utils import ( File "/group/40156/ruidonghan/one_rec/generative-recommenders/generative_recommenders/dlrm_v3/train/utils.py", line 43, in <module> from generative_recommenders.dlrm_v3.utils import get_dataset, MetricsLogger, Profiler File "/group/40156/ruidonghan/one_rec/generative-recommenders/generative_recommenders/dlrm_v3/utils.py", line 39, in <module> from torchrec.metrics.auc import AUCMetricComputation File "/data/miniconda3/envs/openr1/lib/python3.11/site-packages/torchrec/metrics/auc.py", line 15, in <module> from torchmetrics.utilities.distributed import gather_all_tensors File "/data/miniconda3/envs/openr1/lib/python3.11/site-packages/torchmetrics/__init__.py", line 14, in <module> from torchmetrics import functional # noqa: E402 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/miniconda3/envs/openr1/lib/python3.11/site-packages/torchmetrics/functional/__init__.py", line 14, in <module> from torchmetrics.functional.audio._deprecated import _permutation_invariant_training as permutation_invariant_training File "/data/miniconda3/envs/openr1/lib/python3.11/site-packages/torchmetrics/functional/audio/__init__.py", line 14, in <module> from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate File "/data/miniconda3/envs/openr1/lib/python3.11/site-packages/torchmetrics/functional/audio/pit.py", line 23, in <module> from torchmetrics.utilities import rank_zero_warn File "/data/miniconda3/envs/openr1/lib/python3.11/site-packages/torchmetrics/utilities/__init__.py", line 14, in <module> from torchmetrics.utilities.checks import check_forward_full_state_property File "/data/miniconda3/envs/openr1/lib/python3.11/site-packages/torchmetrics/utilities/checks.py",

The easiest method is to set these DeepSpeed config values to 'auto'. [rank1]: Traceback (most recent call last): [rank1]: File "/data1/users/heyu/find_size_and_weight/train711.py", line 625, in <module> [rank1]: train() [rank1]: File "/data1/users/heyu/find_size_and_weight/train711.py", line 617, in train [rank1]: train_result = trainer.train() [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/trainer.py", line 2240, in train [rank1]: return inner_training_loop( [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/trainer.py", line 2322, in _inner_training_loop [rank1]: self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps) [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 444, in deepspeed_init [rank1]: hf_deepspeed_config.trainer_config_finalize(args, model, num_training_steps) [rank1]: File "/data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 268, in trainer_config_finalize [rank1]: raise ValueError( [rank1]: ValueError: Please correct the following DeepSpeed config values that mismatch TrainingArguments values: [rank1]: - ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=1e-05 [rank1]: The easiest method is to set these DeepSpeed config values to 'auto'. 2025-07-11 16:02:32,818 - ERROR - Training failed: Please correct the following DeepSpeed config values that mismatch TrainingArguments values: - ds scheduler.params.warmup_max_lr=0.0001 vs hf learning_rate=1e-05 The easiest method is to set these DeepSpeed config values to 'auto'.上述代码报错,修改

Traceback (most recent call last): File "C:\Anaconda\envs\pytorch\lib\code.py", line 90, in runcode exec(code, self.locals) File "<input>", line 1, in <module> File "D:\DevTools\PyCharm 2021.3.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "D:\DevTools\PyCharm 2021.3.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:/Users/31035/PycharmProjects/pythonProject1/testModels.py", line 32, in <module> dataset = dataset.map(preprocess_function, batched=True) File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py", line 3074, in map for rank, done, content in Dataset._map_single(**dataset_kwargs): File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py", line 3516, in _map_single for i, batch in iter_outputs(shard_iterable): File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py", line 3466, in iter_outputs yield i, apply_function(example, i, offset=offset) File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\arrow_dataset.py", line 3389, in apply_function processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) File "C:/Users/31035/PycharmProjects/pythonProject1/testModels.py", line 10, in preprocess_function targets = [str(label) for label in examples["label"]] # 标签需转换为字符串 File "C:\Anaconda\envs\pytorch\lib\site-packages\datasets\formatting\formatting.py", line 280, in __getitem__ value = self.data[key] KeyError: 'label'

home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/numba/types/__init__.py:110: FutureWarning: In the future np.long will be defined as the corresponding NumPy scalar. long_ = _make_signed(np.long) Traceback (most recent call last): File "tools/train.py", line 24, in <module> from mmdet3d.datasets import build_dataset File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/mmdet3d/datasets/__init__.py", line 4, in <module> from .custom_3d import Custom3DDataset File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/mmdet3d/datasets/custom_3d.py", line 10, in <module> from ..core.bbox import get_box_type File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/mmdet3d/core/__init__.py", line 4, in <module> from .evaluation import * # noqa: F401, F403 File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/mmdet3d/core/evaluation/__init__.py", line 3, in <module> from .kitti_utils import kitti_eval, kitti_eval_coco_style File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/mmdet3d/core/evaluation/kitti_utils/__init__.py", line 2, in <module> from .eval import kitti_eval, kitti_eval_coco_style File "/media/wangbaihui/1ecf654b-afad-4dab-af7b-e34b00dda87a/mmdetection3d/mmdet3d/core/evaluation/kitti_utils/eval.py", line 4, in <module> import numba File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/numba/__init__.py", line 15, in <module> from . import config, errors, _runtests as runtests, types File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/numba/types/__init__.py", line 110, in <module> long_ = _make_signed(np.long) File "/home/wangbaihui/.local/lib/python3.8/site-packages/numpy/__init__.py", line 320, in __getattr__ raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'long' ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2520454) of binary: /home/wangbaihui/anaconda3/envs/vad/bin/python /home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning: ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE ********************************************************************** CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 2520454 (local_rank 0) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example: from torch.distributed.elastic.multiprocessing.errors import record @record def trainer_main(args): # do train ********************************************************************** warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module> main() File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper return f(*args, **kwargs) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangbaihui/anaconda3/envs/vad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** tools/train.py FAILED ======================================= Root Cause: [0]: time: 2025-06-26_15:09:12 rank: 0 (local_rank: 0) exitcode: 1 (pid: 2520454) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> ***************************************

(/home/ubuntu/WorkSpace/env1/xunlian) (xunlian) ubuntu@ubun:~/WorkSpace/xqs/Open-GroundingDino$ pip install yapf==0.32.0 Collecting yapf==0.32.0 Using cached yapf-0.32.0-py2.py3-none-any.whl.metadata (34 kB) Using cached yapf-0.32.0-py2.py3-none-any.whl (190 kB) Installing collected packages: yapf Successfully installed yapf-0.32.0 (/home/ubuntu/WorkSpace/env1/xunlian) (xunlian) ubuntu@ubun:~/WorkSpace/xqs/Open-GroundingDino$ bash /home/ubuntu/WorkSpace/xqs/Open-GroundingDino/scripts/train_dist.sh Traceback (most recent call last): File "/home/ubuntu/WorkSpace/xqs/Open-GroundingDino/main.py", line 19, in <module> from util.slconfig import DictAction, SLConfig File "/home/ubuntu/WorkSpace/xqs/Open-GroundingDino/util/slconfig.py", line 16, in <module> from yapf.yapflib.code_formatting import FormatCode ModuleNotFoundError: No module named 'yapf.yapflib.code_formatting' E0610 11:10:05.385000 952921 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 952958) of binary: /home/ubuntu/WorkSpace/env1/xunlian/bin/python3.11 Traceback (most recent call last): File "/home/ubuntu/WorkSpace/env1/xunlian/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/ubuntu/WorkSpace/env1/xunlian/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/WorkSpace/env1/xunlian/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/ubuntu/WorkSpace/env1/xunlian/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/ubuntu/WorkSpace/env1/xunlian/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/WorkSpace/env1/xunlian/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ main.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-06-10_11:10:05 host : UBUN rank : 0 (local_rank: 0) exitcode : 1 (pid: 952958) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Traceback (most recent call last): File "<string>", line 1, in <module> File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 125, in _main prepare(preparation_data) File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 236, in prepare _fixup_main_from_path(data['init_main_from_path']) File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path main_content = runpy.run_path(main_path, File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 265, in run_path return _run_module_code(code, init_globals, run_name, File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 97, in _run_module_code _run_code(code, mod_globals, init_globals, File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "e:\JZKJ-WORKSPACE\JZKJ-MARKIDENT\v8_train.py", line 8, in <module> results = model.train( File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\ultralytics\engine\model.py", line 791, in train self.trainer.train() File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\ultralytics\engine\trainer.py", line 211, in train self._do_train(world_size) File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\ultralytics\engine\trainer.py", line 326, in _do_train self._setup_train(world_size) File "D:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\ultralytics\engine\trainer.py", line 290, in _setup_train self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=LOCAL_RANK, mode="train") File "D:\Users\us

(VScode) zhaohy@workspace2:~/VSCode-main$ python train_test_eval.py --Training True --Testing True --Evaluation True /home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) /home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) W0311 15:54:21.788752 136232691705664 torch/multiprocessing/spawn.py:145] Terminating process 2342884 via signal SIGTERM Traceback (most recent call last): File "train_test_eval.py", line 67, in <module> Training.train_net(num_gpus=num_gpus, args=args) File "/home/zhaohy/VSCode-main/Training.py", line 64, in train_net mp.spawn(main, nprocs=num_gpus, args=(num_gpus, args)) File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes while not context.join(): File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap fn(i, *args) File "/home/zhaohy/VSCode-main/Training.py", line 71, in main dist.init_process_group(backend='nccl', init_method='env://') # 自动从环境变量读取 RANK/WORLD_SIZE File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper func_return = func(*args, **kwargs) File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 234, in _env_rendezvous_handler rank = int(_get_env_or_raise("RANK")) File "/home/zhaohy/anaconda3/envs/VScode/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 219, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

INFO 07-06 22:09:23 __init__.py:207] Automatically detected platform cuda. (VllmWorkerProcess pid=10542) INFO 07-06 22:09:23 multiproc_worker_utils.py:229] Worker ready; awaiting tasks (VllmWorkerProcess pid=10542) INFO 07-06 22:09:24 cuda.py:178] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=10542) INFO 07-06 22:09:24 cuda.py:226] Using XFormers backend. (VllmWorkerProcess pid=10542) INFO 07-06 22:09:25 utils.py:916] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=10542) INFO 07-06 22:09:25 pynccl.py:69] vLLM is using nccl==2.21.5 INFO 07-06 22:09:25 utils.py:916] Found nccl from library libnccl.so.2 INFO 07-06 22:09:25 pynccl.py:69] vLLM is using nccl==2.21.5 INFO 07-06 22:09:25 custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json INFO 07-06 22:09:39 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json (VllmWorkerProcess pid=10542) INFO 07-06 22:09:39 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json WARNING 07-06 22:09:39 custom_all_reduce.py:145] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (VllmWorkerProcess pid=10542) WARNING 07-06 22:09:39 custom_all_reduce.py:145] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. INFO 07-06 22:09:39 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_b99a0acb'), local_subscribe_port=57279, remote_subscribe_port=None) INFO 07-06 22:09:39 model_runner.py:1110] Starting to load model /home/user/Desktop/DeepSeek-R1-Distill-Qwen-32B... (VllmWorkerProcess pid=10542) INFO 07-06 22:09:39 model_runner.py:1110] Starting to load model /home/user/Desktop/DeepSeek-R1-Distill-Qwen-32B... ERROR 07-06 22:09:39 engine.py:400] CUDA out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 21.49 GiB of which 43.50 MiB is free. Including non-PyTorch memory, this process has 21.44 GiB memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and 20.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ERROR 07-06 22:09:39 engine.py:400] Traceback (most recent call last): ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine ERROR 07-06 22:09:39 engine.py:400] engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args ERROR 07-06 22:09:39 engine.py:400] return cls(ipc_path=ipc_path, ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 76, in __init__ ERROR 07-06 22:09:39 engine.py:400] self.engine = LLMEngine(*args, **kwargs) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 273, in __init__ ERROR 07-06 22:09:39 engine.py:400] self.model_executor = executor_class(vllm_config=vllm_config, ) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 271, in __init__ ERROR 07-06 22:09:39 engine.py:400] super().__init__(*args, **kwargs) ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in __init__ ERROR 07-06 22:09:39 engine.py:400] self._init_executor() ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor ERROR 07-06 22:09:39 engine.py:400] self._run_workers("load_model", ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers ERROR 07-06 22:09:39 engine.py:400] driver_worker_output = run_method(self.driver_worker, sent_method, ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/utils.py", line 2196, in run_method ERROR 07-06 22:09:39 engine.py:400] return func(*args, **kwargs) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model ERROR 07-06 22:09:39 engine.py:400] self.model_runner.load_model() ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1112, in load_model ERROR 07-06 22:09:39 engine.py:400] self.model = get_model(vllm_config=self.vllm_config) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model ERROR 07-06 22:09:39 engine.py:400] return loader.load_model(vllm_config=vllm_config) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model ERROR 07-06 22:09:39 engine.py:400] model = _initialize_model(vllm_config=vllm_config) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model ERROR 07-06 22:09:39 engine.py:400] return model_class(vllm_config=vllm_config, prefix=prefix) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 453, in __init__ ERROR 07-06 22:09:39 engine.py:400] self.model = Qwen2Model(vllm_config=vllm_config, ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 151, in __init__ ERROR 07-06 22:09:39 engine.py:400] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 307, in __init__ ERROR 07-06 22:09:39 engine.py:400] self.start_layer, self.end_layer, self.layers = make_layers( ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 557, in make_layers ERROR 07-06 22:09:39 engine.py:400] [PPMissingLayer() for _ in range(start_layer)] + [ ERROR 07-06 22:09:39 engine.py:400] ^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 558, in ERROR 07-06 22:09:39 engine.py:400] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 309, in <lambda> ERROR 07-06 22:09:39 engine.py:400] lambda prefix: Qwen2DecoderLayer(config=config, ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 220, in __init__ ERROR 07-06 22:09:39 engine.py:400] self.mlp = Qwen2MLP( ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 82, in __init__ ERROR 07-06 22:09:39 engine.py:400] self.down_proj = RowParallelLinear( ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 1062, in __init__ ERROR 07-06 22:09:39 engine.py:400] self.quant_method.create_weights( ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 129, in create_weights ERROR 07-06 22:09:39 engine.py:400] weight = Parameter(torch.empty(sum(output_partition_sizes), ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/torch/utils/_device.py", line 106, in __torch_function__ ERROR 07-06 22:09:39 engine.py:400] return func(*args, **kwargs) ERROR 07-06 22:09:39 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^ ERROR 07-06 22:09:39 engine.py:400] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 21.49 GiB of which 43.50 MiB is free. Including non-PyTorch memory, this process has 21.44 GiB memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and 20.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) Process SpawnProcess-1: INFO 07-06 22:09:39 multiproc_worker_utils.py:128] Killing local vLLM worker processes Traceback (most recent call last): File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine raise e File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 76, in __init__ self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 273, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 271, in __init__ super().__init__(*args, **kwargs) File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in __init__ self._init_executor() File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor self._run_workers("load_model", File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers driver_worker_output = run_method(self.driver_worker, sent_method, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/utils.py", line 2196, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1112, in load_model self.model = get_model(vllm_config=self.vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model return loader.load_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model model = _initialize_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 453, in __init__ self.model = Qwen2Model(vllm_config=vllm_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 151, in __init__ old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 307, in __init__ self.start_layer, self.end_layer, self.layers = make_layers( ^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 557, in make_layers [PPMissingLayer() for _ in range(start_layer)] + [ ^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 558, in maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 309, in <lambda> lambda prefix: Qwen2DecoderLayer(config=config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 220, in __init__ self.mlp = Qwen2MLP( ^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/models/qwen2.py", line 82, in __init__ self.down_proj = RowParallelLinear( ^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 1062, in __init__ self.quant_method.create_weights( File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 129, in create_weights weight = Parameter(torch.empty(sum(output_partition_sizes), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/torch/utils/_device.py", line 106, in __torch_function__ return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 136.00 MiB. GPU 0 has a total capacity of 21.49 GiB of which 43.50 MiB is free. Including non-PyTorch memory, this process has 21.44 GiB memory in use. Of the allocated memory 21.19 GiB is allocated by PyTorch, and 20.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]:[W706 22:09:40.345098611 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) Traceback (most recent call last): File "/root/miniconda3/envs/deepseek_vllm/bin/vllm", line 8, in <module> sys.exit(main()) ^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 73, in main args.dispatch_function(args) File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 34, in cmd uvloop.run(run_server(args)) File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run return runner.run(wrapper()) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server async with build_async_engine_client(args) as engine_client: File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/deepseek_vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. (deepseek_vllm) root@user-X99:/home/user/Desktop# /root/miniconda3/envs/deepseek_vllm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' (deepseek_vllm) root@user-X99:/home/user/Desktop#

最新推荐

recommend-type

微软解决方案面向服务的架构.doc

微软解决方案面向服务的架构.doc
recommend-type

Huawei S6780-H-V600R024SPH120

Huawei S6780-H_V600R024SPH120,里面包含补丁说明书和补丁安装指导书,该补丁支持哪些型号,支持哪些版本可以安装当前补丁,请参考补丁说明书和补丁安装指导书。
recommend-type

网络营销案例分析概述.pptx

网络营销案例分析概述.pptx
recommend-type

VC图像编程全面资料及程序汇总

【标题】:"精通VC图像编程资料全览" 【知识点】: VC即Visual C++,是微软公司推出的一个集成开发环境(IDE),专门用于C++语言的开发。VC图像编程涉及到如何在VC++开发环境中处理和操作图像。在VC图像编程中,开发者通常会使用到Windows API中的GDI(图形设备接口)或GDI+来进行图形绘制,以及DirectX中的Direct2D或DirectDraw进行更高级的图形处理。 1. GDI(图形设备接口): - GDI是Windows操作系统提供的一套应用程序接口,它允许应用程序通过设备无关的方式绘制图形。 - 在VC图像编程中,主要使用CDC类(设备上下文类)来调用GDI函数进行绘制,比如绘制线条、填充颜色、显示文本等。 - CDC类提供了很多函数,比如`MoveTo`、`LineTo`、`Rectangle`、`Ellipse`、`Polygon`等,用于绘制基本的图形。 - 对于图像处理,可以使用`StretchBlt`、`BitBlt`、`TransparentBlt`等函数进行图像的位块传输。 2. GDI+: - GDI+是GDI的后继技术,提供了更丰富的图形处理功能。 - GDI+通过使用`Graphics`类来提供图像的绘制、文本的渲染、图像的处理和颜色管理等功能。 - GDI+引入了对矢量图形、渐变色、复杂的文本格式和坐标空间等更高级的图形处理功能。 - `Image`类是GDI+中用于图像操作的基础类,通过它可以进行图像的加载、保存、旋转、缩放等操作。 3. DirectX: - DirectX是微软推出的一系列API集合,用于在Windows平台上进行高性能多媒体编程。 - DirectX中的Direct2D是用于硬件加速的二维图形API,专门用于UI元素和简单的图形渲染。 - DirectDraw主要用于硬件加速的位图操作,比如全屏游戏开发中的画面渲染。 4. 位图操作: - 在VC图像编程中,位图操作是一个重要的部分。需要了解如何加载、保存和处理位图(BMP)文件。 - 可以使用位图文件格式的解析,来访问位图的像素数据,进行像素级别的图像处理和修改。 5. 高级图像处理技术: - 包括图像滤镜、图像转换、图像压缩和解压缩技术。 - 需要掌握一些图像处理算法,比如卷积、FFT(快速傅里叶变换)、DCT(离散余弦变换)等。 - 了解图像的色彩空间转换,比如RGB到YUV的转换,这在视频处理中非常重要。 6. 图像库的使用: - 除了直接使用API进行图像处理之外,还可以使用开源的图像处理库,如OpenCV。 - OpenCV是一个跨平台的计算机视觉和机器学习软件库,它提供了很多高级的图像处理功能。 【压缩包子文件的文件名称列表】: VC++ 文件名称列表显示为"VC++",这表明文件包中可能包含了Visual C++的项目、源代码文件、动态链接库(DLLs)、编译器设置、链接器设置等。如果要使用这些文件进行VC图像编程,需要确保Visual C++开发环境已经安装并且配置好,同时需要安装好所有依赖的库和工具。 要开始VC图像编程,开发者需要具备C++编程基础,熟悉Windows编程概念,并且对图形学有一定的了解。掌握VC图像编程技巧,对于进行桌面应用程序开发、图像处理软件开发以及游戏开发等都是至关重要的。
recommend-type

Pokemmo响应速度翻倍:多线程处理的高级技巧

# 摘要 多线程处理是提高软件并发性能和响应速度的关键技术,本文从基础概念与优势出发,深入探讨了多线程的实现机制,包括线程模型、调度算法、编程关键技术、线程安全性最佳实践、死锁预防与处理等方面。通
recommend-type

人名列表滚动抽奖

<think>嗯,用户想实现一个滚动抽奖功能,需要前端动画效果和随机选择逻辑。首先,我得考虑前端怎么实现滚动效果。可能用CSS动画或者JavaScript的requestAnimationFrame?比如,使用CSS的@keyframes来控制位移,或者用JS动态更新样式。然后,随机选择算法,可能需要确保公平性,比如用Fisher-Yates洗牌算法,或者用Math.random()来生成随机索引。然后,用户可能需要平滑的滚动动画,比如先快速滚动,然后逐渐减速,最后停在选中的人名上。这可能需要设置定时器,逐步改变位置,或者使用CSS过渡效果。另外,还要考虑性能,避免页面卡顿,可能需要使用硬件加
recommend-type

一站式JSF开发环境:即解压即用JAR包

标题:“jsf开发完整JAR包”所指的知识点: 1. JSF全称JavaServer Faces,是Java EE(现EE4J)规范之一,用于简化Java Web应用中基于组件的用户界面构建。JSF提供了一种模型-视图-控制器(MVC)架构的实现,使得开发者可以将业务逻辑与页面表示分离。 2. “开发完整包”意味着这个JAR包包含了JSF开发所需的所有类库和资源文件。通常来说,一个完整的JSF包会包含核心的JSF库,以及一些可选的扩展库,例如PrimeFaces、RichFaces等,这些扩展库提供了额外的用户界面组件。 3. 在一个项目中使用JSF,开发者无需单独添加每个必要的JAR文件到项目的构建路径中。因为打包成一个完整的JAR包后,所有这些依赖都被整合在一起,极大地方便了开发者的部署工作。 4. “解压之后就可以直接导入工程中使用”表明这个JAR包是一个可执行的归档文件,可能是一个EAR包或者一个可直接部署的Java应用包。解压后,开发者只需将其内容导入到他们的IDE(如Eclipse或IntelliJ IDEA)中,或者将其放置在Web应用服务器的正确目录下,就可以立即进行开发。 描述中所指的知识点: 1. “解压之后就可以直接导入工程中使用”说明这个JAR包是预先配置好的,它可能包含了所有必要的配置文件,例如web.xml、faces-config.xml等,这些文件是JSF项目运行所必需的。 2. 直接使用意味着减少了开发者配置环境和处理依赖的时间,有助于提高开发效率。 标签“jsf jar包”所指的知识点: 1. 标签指明了JAR包的内容是专门针对JSF框架的。因此,这个JAR包包含了JSF规范所定义的API以及可能包含的具体实现,比如Mojarra或MyFaces。 2. “jar包”是一种Java平台的归档文件格式,用于聚合多个文件到一个文件中。在JSF开发中,JAR文件经常被用来打包和分发库或应用程序。 文件名称列表“jsf”所指的知识点: 1. “jsf”文件名可能意味着这是JSF开发的核心库,它应该包含了所有核心的JavaServer Faces类文件以及资源文件。 2. 如果是使用特定版本的JSF,例如“jsf-2.2.jar”,则表明文件内包含了对应版本的JSF实现。这种情况下,开发者必须确认他们所使用的Web服务器或应用程序服务器支持该版本的JSF。 3. 文件名称也可能是“jsf-components.jar”、“jsf-impl.jar”等,表明这个JAR包是JSF的一个子模块或特定功能组件。例如,“jsf-components.jar”可能包含了一系列用于在JSF应用中使用的自定义组件。 4. 对于开发者而言,了解文件名称中所蕴含的信息非常重要,因为这将决定他们需要下载哪些JAR包来满足特定项目的需求。 综合以上信息,开发者在使用JSF进行Java Web应用开发时,会通过一个预先配置好的JAR包来快速地搭建和启动项目。这样做不仅简化了项目初始化的过程,也使得开发者能够更加聚焦于业务逻辑的实现和界面设计,而不必深究底层框架配置的细节。
recommend-type

Pokemmo内存优化揭秘:专家教你如何降低50%资源消耗

# 摘要 本文综述了Pokemmo游戏的内存优化方法,从内存管理基础出发,探讨内存使用效率的影响因素,并介绍了性能监控与分析工具。在内存优化实践技巧章节中,详细讨论了代码层面的优化、数据结构和算法选择对内存效率的影响,并通过案例分析展示了实际的优化过程。针对Pokemmo游戏特点,分析了内存消耗特性并提出了特定优化技术。最后,本文展望了未来内存管理技术的发展方向,以及游戏开发中面临的新挑战,为Pokemmo及类似游戏提供了优化建议。 # 关键字 内存优化;内存管理;性能监控;数据结构;算法效率;游戏开发 参考资源链接:[Pokemmo必备资源包:四种ROM与汉化补丁](https://we
recommend-type

直接访问子路由是吧

<think>我们正在使用Vue2,因此需要参考VueRouterv3的文档(因为Vue2对应VueRouterv3,Vue3对应VueRouterv4)。用户的问题是:如何配置路由以支持直接访问子路由?即,在Vue2中,我们希望直接通过URL访问嵌套的子路由(例如:/parent/child),而不仅仅是先访问父路由再导航到子路由。根据之前的回答和引用,我们已经知道:1.在父路由的配置中,使用`children`数组来定义子路由。2.子路由的`path`不能以斜杠开头(例如:'child'而不是'/child'),这样它就会基于父路由的路径进行拼接。3.在父组件的模板中放置`<router-
recommend-type

C++函数库查询辞典使用指南与功能介绍

标题中提到的“C++函数库查询辞典”指的是一个参考工具书或者是一个软件应用,专门用来查询C++编程语言中提供的标准库中的函数。C++是一种静态类型、编译式、通用编程语言,它支持多种编程范式,包括过程化、面向对象和泛型编程。C++标准库是一组包含函数、类、迭代器和模板的库,它为C++程序员提供标准算法和数据结构。 描述中提供的内容并没有给出实际的知识点,只是重复了标题的内容,并且有一串无关的字符“sdfsdfsdffffffffffffffffff”,因此这部分内容无法提供有价值的信息。 标签“C++ 函数库 查询辞典”强调了该工具的用途,即帮助开发者查询C++的标准库函数。它可能包含每个函数的详细说明、语法、使用方法、参数说明以及示例代码等,是学习和开发过程中不可或缺的参考资源。 文件名称“c++函数库查询辞典.exe”表明这是一个可执行程序。在Windows操作系统中,以“.exe”结尾的文件通常是可执行程序。这意味着用户可以通过双击或者命令行工具来运行这个程序,进而使用其中的查询功能查找C++标准库中各类函数的详细信息。 详细知识点如下: 1. C++标准库的组成: C++标准库由多个组件构成,包括输入输出流(iostream)、算法(algorithm)、容器(container)、迭代器(iterator)、字符串处理(string)、数值计算(numeric)、本地化(locale)等。 2. 输入输出流(iostream)库: 提供输入输出操作的基本功能。使用诸如iostream、fstream、sstream等头文件中的类和对象(如cin, cout, cerr等)来实现基本的输入输出操作。 3. 算法(algorithm)库: 包含对容器进行操作的大量模板函数,如排序(sort)、查找(find)、拷贝(copy)等。 4. 容器(container)库: 提供各种数据结构,如向量(vector)、列表(list)、队列(queue)、映射(map)等。 5. 迭代器(iterator): 迭代器提供了一种方法来访问容器中的元素,同时隐藏了容器的内部结构。 6. 字符串处理(string)库: C++标准库中的字符串类提供了丰富的功能用于处理字符串。 7. 数值计算(numeric)库: 提供数值计算所需的函数和类,比如对复数的支持和数值算法。 8. 本地化(locale)库: 提供本地化相关的功能,比如日期、时间的格式化显示以及字符的本地化比较。 9. 错误处理和异常: C++通过throw、try、catch关键字和标准异常类提供了一套异常处理机制。 10. 智能指针: C++11及其后续版本提供了智能指针(如unique_ptr、shared_ptr、weak_ptr)来自动管理动态分配的内存。 11. lambda表达式: 在C++11中引入,允许临时创建匿名函数对象。 12. C++11新特性: 包括范围for循环、移动语义、类内初始化器、auto类型推导等。 使用C++函数库查询辞典的用户可能需要对C++的基础知识有一定的掌握,例如变量、数据类型、控制结构、函数以及面向对象的概念等。了解C++标准库的结构和内容能够帮助程序员有效地利用库函数进行软件开发,提高编程效率并减少重复造轮子的工作。 总结来说,一个C++函数库查询辞典工具对于C++程序员来说是一个非常有用的资源,它能够提供快速查找标准库函数的能力,帮助程序员更高效地解决问题和进行学习。同时,随着C++标准的不断更新,例如C++11、C++14、C++17和C++20,函数库查询辞典也会不断地更新以包含新的特性,这对于紧跟技术发展的开发者来说尤为重要。