Add support for CUMSUM and TRI for CUDA. #17584

pwilkin · 2025-11-28T23:15:53Z

Extracted and adapted kernels by @gabe-l-hart from #16623

am17an · 2025-11-29T00:51:36Z

For cumsum we should use https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html and use this kernel as a fallback

wsbagnsv1 · 2025-11-29T04:19:20Z

I have a small optimization for the tri kernel (;
Since its memory bandwidth bound there is not much room, but I think those should actually be real improvements and the nsight numbers show real improvements (+18% scheduler utilization). Also the improved kernel seems to have less jitter (~56% decrease, though im not 100% sure this is real, could be run variation). Also its not a big change anyways (;

Benchmark Results

1. llama.cpp benchmark (50 runs each)

Device	Dataset	Old Kernel	New Kernel	Delta
Device 0 (RTX 4070 Ti)	Large (1024)	476.54 GB/s (±17.79)	490.05 GB/s (±7.82)	+2.84%
		527.44 μs	512.26 μs	-2.88%
	Small (256)	1282.55 GB/s (±53.22)	1333.17 GB/s (±29.37)	+3.95%
		6.10 μs	5.86 μs	-3.93%
Device 1 (RTX 2070)	Large (1024)	490.77 GB/s (±0.15)	490.52 GB/s (±0.22)	-0.05%
		511.37 μs	511.64 μs	+0.05%
	Small (256)	356.65 GB/s (±4.47)	361.48 GB/s (±7.81)	+1.35%
		21.91 μs	21.63 μs	-1.28%

2. Profiler Statistics rtx 2070 (Nsight)

Metric	Old Kernel	New Kernel	Delta
Eligible Warps / Scheduler	0.390	0.460	+17.95%
Warp Cycles / Instruction	26.87	24.92	-7.24%
Physical DRAM Speed	406.65 GB/s	406.42 GB/s	-0.05%
Executed Instructions	24.6 M	26.5 M	+7.44%

@@ -1,16 +1,7 @@
 #include "tri.cuh"
 #include "ggml.h"
 
-// Triangle type comparison - determines which elements to keep
-__device__ static inline bool tri_compare(const int i, const int r, const ggml_tri_type type) {
-    switch (type) {
-        case GGML_TRI_TYPE_LOWER:      return i < r;
-        case GGML_TRI_TYPE_LOWER_DIAG: return i <= r;
-        case GGML_TRI_TYPE_UPPER:      return i > r;
-        case GGML_TRI_TYPE_UPPER_DIAG: return i >= r;
-        default: return false;
-    }
-}
+
 
 template<typename T>
 static __global__ void tri_kernel(
@@ -31,10 +22,22 @@ static __global__ void tri_kernel(
     const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
     T       * dst_row = (T       *) ((      char *) dst + i1*nb1  + i2*nb2  + i3*nb3);
 
+    // Optimization: Avoid control flow (switch) inside the hot loop.
+    // Map the 4 triangle types to a generic "split point" and "keep direction" logic.
+    // LOWER / UPPER_DIAG: Split at 'r' (i1). LOWER_DIAG / UPPER: Split at 'r + 1'.
+    int add_to_split = 0;
+    if (ttype == GGML_TRI_TYPE_LOWER_DIAG || ttype == GGML_TRI_TYPE_UPPER) {
+        add_to_split = 1;
+    }
+    int64_t split_point = i1 + add_to_split;
+    bool prefix_keep = (ttype == GGML_TRI_TYPE_LOWER || ttype == GGML_TRI_TYPE_LOWER_DIAG);
+
     // Each thread processes elements at stride blockDim.x
     for (int64_t i0 = threadIdx.x; i0 < ne00; i0 += blockDim.x) {
-        dst_row[i0] = tri_compare(i0, i1, ttype)
-            ? src_row[i0] : static_cast<T>(0.f);
+        // If prefix_keep is true, keep (i0 < split_point). Else, keep (i0 >= split_point).
+        bool keep = ((i0 < split_point) == prefix_keep);
+        dst_row[i0] = keep ? src_row[i0] : T(0);
     }
 }

JohannesGaessler · 2025-11-29T09:28:37Z

ggml/src/ggml-cuda/cumsum.cu

+    const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
+    T       * dst_row = (T       *) ((      char *) dst + i1*nb1  + i2*nb2  + i3*nb3);


As with the other kernel, preferably calculate strides in units of float in host code and pass those.

This is generic though, should I still be calculating in units of float even though T itself might be half?

JohannesGaessler · 2025-11-29T09:29:58Z

ggml/src/ggml-cuda/cumsum.cu

+        // Load value and compute prefix sum within warp
+        float val = static_cast<float>(src_row[i0]);
+        val = warp_prefix_inclusive_sum(val);
+        dst_row[i0] = static_cast<T>(val);


It would be much preferable to store the temporary results in registers or shared memory rather than global memory.

JohannesGaessler · 2025-11-29T09:34:16Z

ggml/src/ggml-cuda/tri.cu

+__device__ static inline bool tri_compare(const int i, const int r, const ggml_tri_type type) {
+    switch (type) {
+        case GGML_TRI_TYPE_LOWER:      return i < r;
+        case GGML_TRI_TYPE_LOWER_DIAG: return i <= r;
+        case GGML_TRI_TYPE_UPPER:      return i > r;
+        case GGML_TRI_TYPE_UPPER_DIAG: return i >= r;
+        default: return false;
+    }
+}


This is going to be very slow in GPU code. Preferably make this a constexpr function and provide the ggml_tri_type at compile time as a template parameter.

JohannesGaessler · 2025-11-29T09:35:02Z

ggml/src/ggml-cuda/tri.cu

+    const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
+    T       * dst_row = (T       *) ((      char *) dst + i1*nb1  + i2*nb2  + i3*nb3);


Preferably calculate the stride in host code.

JohannesGaessler · 2025-11-29T09:41:25Z

Regarding the implementation proposed by @wsbagnsv1 . If one were to do something like that the in my opinion correct way to do it would be to calculate start and end points for copying and for zeroing and to then simply do 2 loops over those areas. If at all possible a conditional statement inside the loop should be avoided. But that would potentially make the kernel less flexible if other patterns for ggml_tri_type are ever implemented (don't know what the intended use cases are). That is why I did not suggest this change, I very much doubt that GGML_TRI is going to have a meaningful impact on end-to-end performance unless it's very poorly implemented.

Add support for CUMSUM and TRI for CUDA.

d138a03

pwilkin requested a review from ggerganov as a code owner November 28, 2025 23:15

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025

Minor optimizations.

67207d2

pwilkin requested review from JohannesGaessler and am17an November 28, 2025 23:27

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA. auroralabs-loci/llama.cpp#355

Open

Correct warp_prefix_inclusive_sum in float2 variant to return float2

fab0029

JohannesGaessler reviewed Nov 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for CUMSUM and TRI for CUDA. #17584

Add support for CUMSUM and TRI for CUDA. #17584

pwilkin commented Nov 28, 2025

Uh oh!

am17an commented Nov 29, 2025

Uh oh!

wsbagnsv1 commented Nov 29, 2025 •

edited

Loading

Uh oh!

JohannesGaessler Nov 29, 2025

Uh oh!

pwilkin Nov 30, 2025

Uh oh!

JohannesGaessler Nov 29, 2025

Uh oh!

JohannesGaessler Nov 29, 2025

Uh oh!

JohannesGaessler Nov 29, 2025

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		const T * src_row = (const T ) ((const char ) src + i1nb01 + i2nb02 + i3*nb03);
		T * dst_row = (T ) (( char ) dst + i1nb1 + i2nb2 + i3*nb3);

Add support for CUMSUM and TRI for CUDA. #17584

Are you sure you want to change the base?

Add support for CUMSUM and TRI for CUDA. #17584

Conversation

pwilkin commented Nov 28, 2025

Uh oh!

am17an commented Nov 29, 2025

Uh oh!

wsbagnsv1 commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

1. llama.cpp benchmark (50 runs each)

2. Profiler Statistics rtx 2070 (Nsight)

Uh oh!

JohannesGaessler Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

pwilkin Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wsbagnsv1 commented Nov 29, 2025 •

edited

Loading