CANN: Multi-stream support#19284
Draft
hipudding wants to merge 7 commits into
Draft
Conversation
Implement ggml_backend_cann_graph_optimize function for CANN backend, ported from Vulkan backend (PR ggml-org#15489 and ggml-org#15850). Key changes: - Add graph optimization to reorder nodes based on dependency analysis - Group non-dependent nodes together for potential parallel execution - Preserve fusion patterns (RMS_NORM+MUL, MUL_MAT+ADD, ADD+RMS_NORM) - Add GGML_CANN_DISABLE_GRAPH_OPTIMIZE env var to disable optimization This is the first step toward multi-stream parallel execution on Ascend NPU.
- Replace tensor-pointer-based dependency tracking with memory-address-based tracking - Use std::map<void*, int> to track pending writes per stream - Implement smart stream selection: - No dependencies: round-robin distribution - Single dependency: execute on same stream (avoid sync overhead) - Multiple dependencies: sync all streams - Add WAW (Write-After-Write) hazard detection - Fix output corruption issue when using multi-stream execution Enable with: GGML_CANN_MULTI_STREAM=1
When GGML_CANN_MULTI_STREAM=1 is set, ACL graph capture/execution must be disabled since they are incompatible. The previous code had a bug where the prefill_use_graph check would overwrite use_cann_graph after it was set to false for multi-stream mode. Fix by wrapping the prefill_use_graph check inside if (use_cann_graph) to ensure it only runs when ACL graph is not already disabled.
- Use parse_bool() for GGML_CANN_MULTI_STREAM environment variable parsing, consistent with other env var handling - Only synchronize dependent streams instead of all streams when a node has multiple dependencies, reducing sync overhead - Performance improvement: ~9% faster prompt processing on 0.5B model (1838 t/s vs 1688 t/s with ACL graph disabled)
- Add operator_fusion_enabled flag to ggml_backend_cann_context - Implement conflict detection in constructor: * ACL graph mode disables multi-stream (higher performance) * Multi-stream mode disables operator fusion (low benefit) - Remove multi-stream fusion code (fusion disabled in multi-stream) - Keep fusion functionality in single-stream mode - Remove redundant multi_stream_enabled check in graph_compute - Fix unused variable warning (sync_all_to_stream)
Remove all operator fusion pattern detection logic from graph optimization to focus on reducing dependencies between operators in multi-stream scenarios. Key changes: - Remove fusion pattern matching for RMS_NORM+MUL, MUL_MAT+ADD, etc. - Remove match_pattern and keep_pattern helper functions - Simplify to two-pass approach: real nodes first, then view nodes - Focus on dependency analysis for better parallelism - Reduce code complexity by ~47% (235 lines -> 125 lines) This approach is inspired by the Vulkan backend implementation and prioritizes multi-stream parallelism over operator fusion, as fusion provides minimal performance benefits in the CANN backend. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make sure to read the contributing guidelines before submitting a PR