Skip to content

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17584

Extracted and adapted kernels by @gabe-l-hart from ggml-org/llama.cpp#16623

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #355 - CUDA CUMSUM and TRI Operations

Overview

PR #355 introduces CUDA kernel implementations for cumulative sum (CUMSUM) and triangular matrix (TRI) operations, adding 315 lines across 7 files. Analysis shows no measurable performance impact on existing inference paths, with 0% power consumption change across all 16 binaries.

Key Findings

Performance Impact on Inference:
No functions in the core inference path (llama_decode, llama_encode, llama_tokenize) show response time or throughput changes. The additions are purely additive CUDA operations that remain inactive in standard CPU-based inference workloads. Tokens per second remains unaffected as tokenization and decode functions maintain identical execution characteristics.

Implementation Details:
The PR adds two new CUDA operations with supporting warp-level primitives. The TRI operation implements element-wise triangular matrix extraction with coalesced memory access patterns. The CUMSUM operation uses a two-phase algorithm: warp-level prefix sums followed by cross-warp accumulation via shared memory. Both operations support F32, F16, and BF16 data types.

Code Integration:
Changes integrate cleanly into the existing CUDA backend dispatch system in ggml-cuda.cu. Three new warp_prefix_inclusive_sum template functions were added to common.cuh as reusable primitives. The implementation includes test coverage for both operations with various tensor dimensions.

Power Consumption:
All binaries show 0 nJ change in estimated power consumption. The most power-intensive components remain llama-tts (224,623 nJ), llama-cvector-generator (220,236 nJ), and libllama.so (193,066 nJ). The new operations do not alter the computational workload of existing execution paths.

Technical Note:
The half2 specialization in common.cuh contains a compilation error (missing type declaration on line 40) that will prevent builds when FP16 is enabled. This affects CUDA devices with half-precision support but does not impact current CPU inference paths or performance metrics.

@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #355

Overview

PR #355 adds CUDA backend support for CUMSUM and TRI operations through new kernel implementations. The changes introduce 315 lines of new code across 7 files without modifying existing functionality.

Performance Impact

No performance changes detected. All 16 analyzed binaries show 0.0% change in power consumption. No functions exhibit measurable Response Time or Throughput Time variations between versions.

Inference Performance: The core inference functions (llama_decode, llama_encode, llama_tokenize) remain unmodified. Token throughput is unaffected as these new operations are not yet exercised in current workloads.

Power Consumption: All binaries maintain identical energy profiles:

  • libllama.so: 193,066 nJ (0% change)
  • llama-tts: 224,623 nJ (0% change)
  • llama-cvector-generator: 220,237 nJ (0% change)
  • llama-run: 191,888 nJ (0% change)
  • Remaining 12 binaries: 0% change

Code Implementation

The PR implements two new CUDA operations:

CUMSUM: Computes cumulative sum using two-phase algorithm with warp-level prefix sums followed by inter-warp accumulation. Supports F32, F16, and BF16 types.

TRI: Applies triangular matrix masking for attention mechanisms. Uses simple per-element comparison with configurable mask types (lower/upper with optional diagonal).

Integration: Both operations are registered in the CUDA backend dispatch system and declared as supported operations. The implementation follows existing GGML CUDA patterns with proper type handling and bounds checking.

Compilation Issue: The half2 warp_prefix_inclusive_sum specialization contains a type declaration error that will prevent compilation when FP16 is enabled.

Conclusion

This PR extends CUDA backend capabilities without impacting existing performance. The new operations enable future model architectures requiring cumulative and triangular masking operations to run fully on CUDA without CPU fallback.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from e4a4e1d to d0b408b Compare November 30, 2025 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants