NPUDoublePsEmbFunctor Init Done. Table: UnirecEmbedding [E compiler_depend.ts:280] call aclnnMatmul failed, detail:EZ9903: [PID: 543476] 2025-06-16-10:46:41.752.975 rtKernelLaunchWithFlagV2 failed: 107003 Solution: In this scenario, collect the plog when the fault occurs and locate the fault based on the plog. TraceBack (most recent call last): Kernel launch failed, stream is not in current ctx, stream_id=3.[FUNC:KernelLaunch][FILE:api_impl.cc][LINE:465] rtKernelLaunchWithFlagV2 execute failed, reason=[stream not in current context][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] rtKernelLaunchWithFlagV2 failed: 107003 #### KernelLaunch failed: /usr/local/Ascend/ascend-toolkit/8.0.T60/opp/built-in/op_impl/ai_core/tbe//kernel/ascend910b/mem_set/MemSet_1a6864193b99ef93ef38616f04a712ab_high_performance.o Memset Output Tensor failed Kernel Run failed. opType: 3, MatMulV2 launch failed for MatMulV2, errno:361001. [ERROR] 2025-06-16-10:46:41 (PID:543476, Device:0, RankID:-1) ERR01100 OPS call acl api failed Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:84 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff5a3392617 in /usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7ff5a334d98d in /usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: <unknown function> + 0xed1162 (0x7ff4fcad1162 in /usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: <unknown function> + 0x14c27bc (0x7ff4fd0c27bc in /usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #4: <unknown function> + 0x7365b7 (0x7ff4fc3365b7 in /usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #5: <unknown function> + 0x736a76 (0x7ff4fc336a76 in /usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #6: <unknown function> + 0x73468f (0x7ff4fc33468f in /usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #7: <unknown function> + 0xdbbf4 (0x7ff5736c7bf4 in /usr/local/miniconda3/lib/libstdc++.so.6) frame #8: <unknown function> + 0x8937a (0x7ff5bcc8b37a in /lib64/libc.so.6) frame #9: <unknown function> + 0x109e4c (0x7ff5bcd0be4c in /lib64/libc.so.6) [W compiler_depend.ts:286] Warning: (function ExecFunc) Traceback (most recent call last): File "/workspace/doupleps_emb_test.py", line 258, in <module> test_simple_trainning() File "/workspace/doupleps_emb_test.py", line 158, in test_simple_trainning my_out0, my_loss, my_out1 = _training_loop(my_model, lr, "unirec") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/doupleps_emb_test.py", line 119, in _training_loop out = model(index) ^^^^^^^^^^^^ File "/usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/workspace/doupleps_emb_test.py", line 44, in forward x0 = self.linear1(emb_out) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/miniconda3/envs/torchrec/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnMatmul. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1. [ERROR] 2025-06-16-10:46:41 (PID:543476, Device:0, RankID:-1) ERR00100 PTA call acl api failed [W compiler_depend.ts:305] Warning: NPU warning, error code is 107003[Error]: [Error]: The stream is not in the current context. Check whether the context where the stream is located is the same as the current context. EE9999: Inner Error! EE9999: [PID: 543476] 2025-06-16-10:46:41.760.771 Stream destroy failed, stream is not in current ctx, stream_id=3.[FUNC:StreamDestroy][FILE:api_impl.cc][LINE:1104] TraceBack (most recent call last): rtStreamDestroyForce execute failed, reason=[stream not in current context][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] destroy stream force failed, runtime result = 107003[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161] (function operator()) I0616 10:46:42.707563 543476 hbm_table_manager.h:75] NpuHbmSparseTableManager destructor called Segmentation fault (core dumped)

时间: 2025-06-17 18:45:54 浏览: 7

### 解决 aclnnMatmul 调用失败的问题错误代码 `EZ9903` 和 `107003` 通常与 NPU 环境配置、上下文管理以及矩阵乘法操作的参数设置有关。以下是对问题的详细分析和解决方案。 #### 错误代码 EZ9903 的原因错误代码 `EZ9903` 表示在调用 `aclnnMatmul` 时，矩阵乘法操作未能成功完成。这可能是由于以下几个原因： - 输入张量的形状不匹配或数据类型不支持[^1]。 - 设备内存不足，无法分配足够的资源来执行矩阵运算。 - 当前上下文未正确设置，导致 NPU 流不在当前上下文中[^2]。为了解决此问题，建议检查以下内容： - 确保输入张量的维度符合矩阵乘法的要求。 - 使用工具（如 `nvidia-smi` 或 Ascend 的相关工具）监控设备内存使用情况，确保有足够的可用内存。 - 检查是否正确设置了上下文和设备，例如通过调用 `acl.rt.set_context` 和 `acl.rt.set_device`[^1]。 #### 错误代码 107003 的原因错误代码 `107003` 通常与 NPU 流的上下文管理有关。具体原因可能包括： - 当前流未正确绑定到上下文。 - 上下文初始化失败或未正确释放。 - 内核启动时传递的参数超出硬件限制。为了解决此问题，建议执行以下操作： - 确保在调用 `aclnnMatmul` 之前正确初始化上下文和流。 - 检查是否正确释放了上下文资源，避免资源泄漏。 - 根据引用内容[^2]，调整 `.bashrc` 文件中的环境变量设置，确保仅加载必要的库文件，避免冲突。 #### 示例代码：验证 MatMulV2 操作的上下文设置以下是一个简单的代码示例，用于验证 `MatMulV2` 操作的上下文设置是否正确： ```python import numpy as np from ascend import aclrt, aclnn # 初始化上下文和流 context = aclrt.create_context(0) stream = aclrt.create_stream() # 定义输入张量 input_a = np.random.rand(3, 4).astype(np.float32) input_b = np.random.rand(4, 5).astype(np.float32) # 创建输出张量 output = np.zeros((3, 5), dtype=np.float32) try: # 设置上下文和流 aclrt.set_context(context) aclrt.set_current_stream(stream) # 调用 MatMulV2 aclnn.matmul(input_a, input_b, output) print("MatMulV2 调用成功") except Exception as e: print(f"MatMulV2 调用失败: {e}") finally: # 释放资源 aclrt.destroy_stream(stream) aclrt.destroy_context(context) ``` #### 驱动版本和库兼容性如果问题仍然存在，可能是驱动版本或运行时库与硬件不兼容。根据引用内容[^3]，建议升级 NPU 驱动版本至最新版本，并重新构建系统以确保兼容性。 ###

阅读全文

相关推荐

ELAN_C.rar_74hc164_93c46和24c02_COMPILER:Elan_ELAN_C_风扇

Keil.ARM-Compiler.1.7.2.pack； 解压密码：1234； Keil.ARM-Compiler.1.7.2

Failed to execute goal org.apache.maven.plugins:maven-compiler

simulacion.rar_PID CCS COMPILER_ccs pid_pid ccs_pid proteus_prot

Compiler_Design_with_C++ - J. Rodrix_C++_compiler_

Maven更新失败，Cannot resolve plugin org.apache.maven.plugins:maven-compiler-plugin:3.1

FPGA.rar_FPGA Compiler_compiler_fpga_vlhd

rules_typescript：移至https：github.combazelbuildrules_nodejstree3.xthird_partygithub.combazelbuildrules_typescript

基准：:input_latin_lowercase::female_sign::female_sign:下一代下一代级级级:female_sign::female_sign::female_sign::female_sign::female_sign::female_sign::female_sign::female_sign::female_sign::female_sign::female_sign::female_sign::female_sign:read read read read

gcc.zip_compiler_english_gcc

DC.rar_Chip_SG_01.pdf_DC_DC综合_DESIGN COMPILER_compiler

emojicode：:grinning_face::winking_face_with_tongue::repeat_single_button:世界上唯一带有emojis的编程语言

dev_compiler_playground:Dart的dev_compiler游乐场-看看什么有效，什么无效

CSharpCSharp.rar_csharp compiler c_excitingspg_多媒体

org.eclipse.jdt.core_3.6.1.v_A68_R36x_jar

org.eclipse.jdt.core_3.5.2.v_981_R35x

setupCCSPlatinum_v30329.zip_CCS COMPILER_c2000 compiler

CCS4.038.rar_CCS4_PIC CCS_ccs c compiler_ccs c pic_pic-CAN

bitcoindevkit.org：BDK项目主页（最初是magicalbitcoin.org:mage:）

mpi2_init.rar_Windows编程_Unix_Linux_

大家在看

awvs使用手册

隔离型USB485422232TTL使用手册详解

SDCC簡明手冊

毕业设计&课设-一个基于Matlab的PET仿真和重建框架，具有系统矩阵的分析建模，能够结合各种数据….zip

网络信息扫描实验

最新推荐

IntelliJ IDEA报错Error:java: Compilation failed: internal java compiler error的解决办法

互联网信息技术与高校化学实验教学深度融合的意蕴解读与路径探索(1).docx

本科毕业设计--基于单片机的led点阵电子显示屏(1).doc

单片机实验开发板程序编写指南

【性能测试基准】：为RK3588选择合适的NVMe性能测试工具指南

ubuntu 检查下载源

办公软件：下载使用指南与资源包

【固态硬盘寿命延长】：RK3588平台NVMe维护技巧大公开

DSNPx是什么

MW6208E量产工具固件升级包介绍

Keil.ARM-Compiler.1.7.2.pack；解压密码：1234； Keil.ARM-Compiler.1.7.2