-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA. #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #355 - CUDA CUMSUM and TRI OperationsOverviewPR #355 introduces CUDA kernel implementations for cumulative sum (CUMSUM) and triangular matrix (TRI) operations, adding 315 lines across 7 files. Analysis shows no measurable performance impact on existing inference paths, with 0% power consumption change across all 16 binaries. Key FindingsPerformance Impact on Inference: Implementation Details: Code Integration: Power Consumption: Technical Note: |
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #355OverviewPR #355 adds CUDA backend support for CUMSUM and TRI operations through new kernel implementations. The changes introduce 315 lines of new code across 7 files without modifying existing functionality. Performance ImpactNo performance changes detected. All 16 analyzed binaries show 0.0% change in power consumption. No functions exhibit measurable Response Time or Throughput Time variations between versions. Inference Performance: The core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Token throughput is unaffected as these new operations are not yet exercised in current workloads. Power Consumption: All binaries maintain identical energy profiles:
Code ImplementationThe PR implements two new CUDA operations: CUMSUM: Computes cumulative sum using two-phase algorithm with warp-level prefix sums followed by inter-warp accumulation. Supports F32, F16, and BF16 types. TRI: Applies triangular matrix masking for attention mechanisms. Uses simple per-element comparison with configurable mask types (lower/upper with optional diagonal). Integration: Both operations are registered in the CUDA backend dispatch system and declared as supported operations. The implementation follows existing GGML CUDA patterns with proper type handling and bounds checking. Compilation Issue: The half2 warp_prefix_inclusive_sum specialization contains a type declaration error that will prevent compilation when FP16 is enabled. ConclusionThis PR extends CUDA backend capabilities without impacting existing performance. The new operations enable future model architectures requiring cumulative and triangular masking operations to run fully on CUDA without CPU fallback. |
e4a4e1d to
d0b408b
Compare
Mirrored from ggml-org/llama.cpp#17584
Extracted and adapted kernels by @gabe-l-hart from ggml-org/llama.cpp#16623