Skip to content

CANN: Multi-stream support#19284

Draft
hipudding wants to merge 7 commits into
ggml-org:masterfrom
hipudding:mul_stream
Draft

CANN: Multi-stream support#19284
hipudding wants to merge 7 commits into
ggml-org:masterfrom
hipudding:mul_stream

Conversation

@hipudding
Copy link
Copy Markdown
Contributor

Make sure to read the contributing guidelines before submitting a PR

@hipudding hipudding self-assigned this Feb 3, 2026
@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Feb 3, 2026
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 3, 2026
@hipudding hipudding closed this Feb 3, 2026
@hipudding hipudding reopened this Feb 3, 2026
Implement ggml_backend_cann_graph_optimize function for CANN backend,
ported from Vulkan backend (PR ggml-org#15489 and ggml-org#15850).

Key changes:
- Add graph optimization to reorder nodes based on dependency analysis
- Group non-dependent nodes together for potential parallel execution
- Preserve fusion patterns (RMS_NORM+MUL, MUL_MAT+ADD, ADD+RMS_NORM)
- Add GGML_CANN_DISABLE_GRAPH_OPTIMIZE env var to disable optimization

This is the first step toward multi-stream parallel execution on Ascend NPU.
- Replace tensor-pointer-based dependency tracking with memory-address-based tracking
- Use std::map<void*, int> to track pending writes per stream
- Implement smart stream selection:
  - No dependencies: round-robin distribution
  - Single dependency: execute on same stream (avoid sync overhead)
  - Multiple dependencies: sync all streams
- Add WAW (Write-After-Write) hazard detection
- Fix output corruption issue when using multi-stream execution

Enable with: GGML_CANN_MULTI_STREAM=1
When GGML_CANN_MULTI_STREAM=1 is set, ACL graph capture/execution must
be disabled since they are incompatible. The previous code had a bug
where the prefill_use_graph check would overwrite use_cann_graph after
it was set to false for multi-stream mode.

Fix by wrapping the prefill_use_graph check inside if (use_cann_graph)
to ensure it only runs when ACL graph is not already disabled.
- Use parse_bool() for GGML_CANN_MULTI_STREAM environment variable
  parsing, consistent with other env var handling
- Only synchronize dependent streams instead of all streams when
  a node has multiple dependencies, reducing sync overhead
- Performance improvement: ~9% faster prompt processing on 0.5B model
  (1838 t/s vs 1688 t/s with ACL graph disabled)
- Add operator_fusion_enabled flag to ggml_backend_cann_context
- Implement conflict detection in constructor:
  * ACL graph mode disables multi-stream (higher performance)
  * Multi-stream mode disables operator fusion (low benefit)
- Remove multi-stream fusion code (fusion disabled in multi-stream)
- Keep fusion functionality in single-stream mode
- Remove redundant multi_stream_enabled check in graph_compute
- Fix unused variable warning (sync_all_to_stream)
hipudding and others added 2 commits February 6, 2026 02:51
Remove all operator fusion pattern detection logic from graph optimization
to focus on reducing dependencies between operators in multi-stream scenarios.

Key changes:
- Remove fusion pattern matching for RMS_NORM+MUL, MUL_MAT+ADD, etc.
- Remove match_pattern and keep_pattern helper functions
- Simplify to two-pass approach: real nodes first, then view nodes
- Focus on dependency analysis for better parallelism
- Reduce code complexity by ~47% (235 lines -> 125 lines)

This approach is inspired by the Vulkan backend implementation and
prioritizes multi-stream parallelism over operator fusion, as fusion
provides minimal performance benefits in the CANN backend.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant