Skip to content

Conversation

@pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Nov 28, 2025

Extracted and adapted kernels by @gabe-l-hart from #16623

@pwilkin pwilkin requested a review from ggerganov as a code owner November 28, 2025 23:15
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025
@am17an
Copy link
Collaborator

am17an commented Nov 29, 2025

For cumsum we should use https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html and use this kernel as a fallback

@wsbagnsv1
Copy link

wsbagnsv1 commented Nov 29, 2025

I have a small optimization for the tri kernel (;
Since its memory bandwidth bound there is not much room, but I think those should actually be real improvements and the nsight numbers show real improvements (+18% scheduler utilization). Also the improved kernel seems to have less jitter (~56% decrease, though im not 100% sure this is real, could be run variation). Also its not a big change anyways (;

Benchmark Results

1. llama.cpp benchmark (50 runs each)

Device Dataset Old Kernel New Kernel Delta
Device 0 (RTX 4070 Ti) Large (1024) 476.54 GB/s (±17.79) 490.05 GB/s (±7.82) +2.84%
527.44 μs 512.26 μs -2.88%
Small (256) 1282.55 GB/s (±53.22) 1333.17 GB/s (±29.37) +3.95%
6.10 μs 5.86 μs -3.93%
Device 1 (RTX 2070) Large (1024) 490.77 GB/s (±0.15) 490.52 GB/s (±0.22) -0.05%
511.37 μs 511.64 μs +0.05%
Small (256) 356.65 GB/s (±4.47) 361.48 GB/s (±7.81) +1.35%
21.91 μs 21.63 μs -1.28%

2. Profiler Statistics rtx 2070 (Nsight)

Metric Old Kernel New Kernel Delta
Eligible Warps / Scheduler 0.390 0.460 +17.95%
Warp Cycles / Instruction 26.87 24.92 -7.24%
Physical DRAM Speed 406.65 GB/s 406.42 GB/s -0.05%
Executed Instructions 24.6 M 26.5 M +7.44%
@@ -1,16 +1,7 @@
 #include "tri.cuh"
 #include "ggml.h"
 
-// Triangle type comparison - determines which elements to keep
-__device__ static inline bool tri_compare(const int i, const int r, const ggml_tri_type type) {
-    switch (type) {
-        case GGML_TRI_TYPE_LOWER:      return i < r;
-        case GGML_TRI_TYPE_LOWER_DIAG: return i <= r;
-        case GGML_TRI_TYPE_UPPER:      return i > r;
-        case GGML_TRI_TYPE_UPPER_DIAG: return i >= r;
-        default: return false;
-    }
-}
+
 
 template<typename T>
 static __global__ void tri_kernel(
@@ -31,10 +22,22 @@ static __global__ void tri_kernel(
     const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
     T       * dst_row = (T       *) ((      char *) dst + i1*nb1  + i2*nb2  + i3*nb3);
 
+    // Optimization: Avoid control flow (switch) inside the hot loop.
+    // Map the 4 triangle types to a generic "split point" and "keep direction" logic.
+    // LOWER / UPPER_DIAG: Split at 'r' (i1). LOWER_DIAG / UPPER: Split at 'r + 1'.
+    int add_to_split = 0;
+    if (ttype == GGML_TRI_TYPE_LOWER_DIAG || ttype == GGML_TRI_TYPE_UPPER) {
+        add_to_split = 1;
+    }
+    int64_t split_point = i1 + add_to_split;
+    bool prefix_keep = (ttype == GGML_TRI_TYPE_LOWER || ttype == GGML_TRI_TYPE_LOWER_DIAG);
+
     // Each thread processes elements at stride blockDim.x
     for (int64_t i0 = threadIdx.x; i0 < ne00; i0 += blockDim.x) {
-        dst_row[i0] = tri_compare(i0, i1, ttype)
-            ? src_row[i0] : static_cast<T>(0.f);
+        // If prefix_keep is true, keep (i0 < split_point). Else, keep (i0 >= split_point).
+        bool keep = ((i0 < split_point) == prefix_keep);
+        dst_row[i0] = keep ? src_row[i0] : T(0);
     }
 }

Comment on lines +29 to +30
const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
T * dst_row = (T *) (( char *) dst + i1*nb1 + i2*nb2 + i3*nb3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with the other kernel, preferably calculate strides in units of float in host code and pass those.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generic though, should I still be calculating in units of float even though T itself might be half?

// Load value and compute prefix sum within warp
float val = static_cast<float>(src_row[i0]);
val = warp_prefix_inclusive_sum(val);
dst_row[i0] = static_cast<T>(val);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much preferable to store the temporary results in registers or shared memory rather than global memory.

Comment on lines +5 to +13
__device__ static inline bool tri_compare(const int i, const int r, const ggml_tri_type type) {
switch (type) {
case GGML_TRI_TYPE_LOWER: return i < r;
case GGML_TRI_TYPE_LOWER_DIAG: return i <= r;
case GGML_TRI_TYPE_UPPER: return i > r;
case GGML_TRI_TYPE_UPPER_DIAG: return i >= r;
default: return false;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be very slow in GPU code. Preferably make this a constexpr function and provide the ggml_tri_type at compile time as a template parameter.

Comment on lines +31 to +32
const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
T * dst_row = (T *) (( char *) dst + i1*nb1 + i2*nb2 + i3*nb3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably calculate the stride in host code.

@JohannesGaessler
Copy link
Collaborator

Regarding the implementation proposed by @wsbagnsv1 . If one were to do something like that the in my opinion correct way to do it would be to calculate start and end points for copying and for zeroing and to then simply do 2 loops over those areas. If at all possible a conditional statement inside the loop should be avoided. But that would potentially make the kernel less flexible if other patterns for ggml_tri_type are ever implemented (don't know what the intended use cases are). That is why I did not suggest this change, I very much doubt that GGML_TRI is going to have a meaningful impact on end-to-end performance unless it's very poorly implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants