Skip to content

Conversation

@zhang-hui-yulo
Copy link
Contributor

@zhang-hui-yulo zhang-hui-yulo commented Nov 7, 2025

Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF

Model Microbatch size Test t/s master t/s 672492fc Speedup
qwen3 8B Q8_0 1 pp512 46.48 54.61 1.18
qwen3 8B Q8_0 2 pp512 89.96 85.92 0.96
qwen3 8B Q8_0 3 pp512 132.92 126.23 0.95
qwen3 8B Q8_0 4 pp512 176.06 166.12 0.94
qwen3 8B Q8_0 5 pp512 212.00 197.77 0.93
qwen3 8B Q8_0 6 pp512 252.54 233.83 0.93
qwen3 8B Q8_0 7 pp512 289.87 266.58 0.92
qwen3 8B Q8_0 8 pp512 318.56 290.63 0.91
qwen3 8B Q8_0 9 pp512 344.41 314.93 0.91
qwen3 8B Q8_0 10 pp512 377.97 342.75 0.91
qwen3 8B Q8_0 11 pp512 416.42 373.85 0.90
qwen3 8B Q8_0 12 pp512 447.61 398.83 0.89
qwen3 8B Q8_0 13 pp512 486.83 429.74 0.88
qwen3 8B Q8_0 14 pp512 525.24 458.88 0.87
qwen3 8B Q8_0 15 pp512 555.91 482.08 0.87
qwen3 8B Q8_0 16 pp512 580.07 512.47 0.88

@JohannesGaessler, looks like that #16988 changes mmf.cu to

for (size_t i = 0; i < GGML_MAX_DIMS; ++i) {
    if (src0_nb[i] % (2*ts) != 0) {
        return false;
    }
}

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 7, 2025
@JohannesGaessler
Copy link
Collaborator

honestly the performance is lower than expectation.

On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

Yes sorry, that was a bug that I introduced.

@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Nov 8, 2025

honestly the performance is lower than expectation.

On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

Yes sorry, that was a bug that I introduced.

Thank you for the tip, AFAIK, tensor core on RDNA3 uses the same silicon of vector instructions, RDNA4 redesigns the tensor core and makes it more like CDNA.

But at least it shall not be slower than hipblas, I shall spend sometime to find out the root cause, at least I know that hip compiler doesn't acquire register very well.

@JohannesGaessler
Copy link
Collaborator

Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated ldmatrix instruction can be used to load 4 bytes per thread with groups of 4 threads making a single 16 byte load from shared memory. If you just load 4 byte chunks with the regular indices provided by get_i and get_j you end up with the memory accesses going to only 16 out of (I think) the 32 shared memory banks and you only get 50% of the memory bandwidth.

@zhang-hui-yulo
Copy link
Contributor Author

Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated ldmatrix instruction can be used to load 4 bytes per thread with groups of 4 threads making a single 16 byte load from shared memory. If you just load 4 byte chunks with the regular indices provided by get_i and get_j you end up with the memory accesses going to only 16 out of (I think) the 32 shared memory banks and you only get 50% of the memory bandwidth.

Thank you for the tips, there is little info for AMD bank layout, based on the limited document I have, RDNA3 has 32 banks in CU mode and 54 banks in WGP mode, WGP mode is the default one, the bank width is DWORD, I don't have any doc for RDNA4 so I assume it shall be similar as before, so I don't change any code logic in mmf, just adapter the wmma instruction.

Based on the wmma layout of RDNA4, I just keep the old ldmatrix logic and use vectorized load in load_generic, honestly I'm not sure if shared memory bank conflict is the root cause.

wmma_f16_16x16x16_f16_w32_gfx12

@JohannesGaessler
Copy link
Collaborator

Are you aware of the AMD ISA documentation?

@zhang-hui-yulo
Copy link
Contributor Author

Are you aware of the AMD ISA documentation?

Honestly, not very much as it isn't friendly for software developer, based on the gemm benchmark on my modified nvidia cute for RDNA3, the bank layout is same as nvidia ampere.

Just have a check on RDNA4 ISA

1.2.2.1. Local Data Share (LDS)

...
Each work-group processor (WGP) has a 128kB memory space that enables low-latency communication
 between work-items within a work-group, or the work-items within a wave; this is the local data share (LDS).
 This memory is configured with 64 banks, each with 512 entries of 4 bytes. 
12.1. Overview
 There are 128kB of memory per work-group processor split up into 64 banks of DWORD-wide RAMs. These 64
 banks are further sub-divided into two sets of 32-banks each where 32 of the banks are affiliated with a pair of
 SIMD32’s, and the other 32 banks are affiliated with the other pair of SIMD32’s within the WGP. Each bank is a
 512x32 two-port RAM (1R/1W per clock cycle). DWORDs are placed in the banks serially, but all banks can
 execute a store or load simultaneously. One work-group can request up to 64kB memory.

So, it's 32 banks in CU mode and 64 banks in WGP mode, the bank width is 4 bytes.

@zhang-hui-yulo
Copy link
Contributor Author

I think I've found the root cause, mat_mmf_f is slower than hipblas, mat_mmf_ids is better than hipblas.

MUL_MAT

Backend GGML op Op parameters TFLOPS master TFLOPS mmf_wmma_rdna4 Speedup
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.20 1.20 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.65 1.65 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.98 1.79 0.90
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.50 2.18 0.87
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 92.19 93.01 1.01
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.92 3.22 0.82
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],k_v=32832,o=1 1.37 1.37 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],k_v=0,o=1 0.34 0.33 0.98
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.21 1.21 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.77 1.77 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.27 2.27 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.67 2.67 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 95.79 95.64 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.03 3.25 0.81
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.31 0.31 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.63 0.63 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.94 0.94 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.25 1.25 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.54 1.54 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.47 3.47 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.30 2.30 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.64 3.65 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.94 5.93 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.00 6.97 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.84 7.83 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.19 8.26 1.01
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.72 74.74 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.37 9.34 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.13 4.13 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.92 6.90 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.43 7.40 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.22 8.21 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.84 8.91 1.01
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 75.09 75.23 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.03 8.91 0.99
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.61 1.60 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.85 2.84 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.97 3.97 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.80 4.80 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.09 5.08 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.15 74.19 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.53 6.52 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.20 2.20 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.85 3.84 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.05 5.05 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.88 5.87 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.62 6.61 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.19 74.25 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.79 7.79 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.66 1.66 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.99 2.99 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.16 4.16 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.04 5.04 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.82 5.82 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.48 74.52 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.90 6.90 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.56 1.56 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.88 2.87 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.04 4.04 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.02 5.01 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.86 5.84 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.34 73.21 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.06 7.04 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.14 2.13 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.79 3.79 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.08 5.08 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.81 5.80 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.85 6.84 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.37 73.46 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.69 7.68 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.79 3.79 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.48 5.48 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.24 7.21 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.77 8.70 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.54 8.47 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.52 72.51 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.25 9.22 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.76 3.76 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.54 6.52 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.94 8.93 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.69 9.66 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.25 10.21 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.60 72.56 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.13 10.11 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.56 3.56 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.08 5.05 0.99
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.15 7.12 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.53 8.54 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.72 8.70 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.92 73.95 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.13 9.13 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.86 2.85 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.69 3.68 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.02 4.00 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.20 4.19 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.32 4.30 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.18 72.15 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.36 4.36 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.71 1.71 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.82 2.81 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.41 3.41 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.84 3.83 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.10 4.09 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.74 68.69 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.41 4.41 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.02 4.02 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.47 5.47 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.35 7.33 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.74 8.73 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.64 8.60 0.99
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.55 72.71 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.50 9.44 0.99
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.95 3.95 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.88 6.86 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.59 7.57 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.02 9.00 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.84 8.68 0.98
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.18 72.13 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.76 9.74 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.70 2.70 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.69 3.68 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.05 4.05 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.24 4.23 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.37 4.35 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.00 71.93 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.57 4.57 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.11 3.10 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.22 5.21 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.46 6.46 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.80 7.79 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.18 8.08 0.99
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 62.94 62.84 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.89 8.88 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.36 3.35 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.59 5.59 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.04 7.02 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.05 8.05 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.77 8.75 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 63.41 63.49 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.54 9.54 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.53 2.53 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.53 3.53 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.90 3.89 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.13 4.12 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.29 4.29 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.41 69.38 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.50 4.49 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.84 1.83 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.91 2.90 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.66 3.66 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.15 4.15 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.51 4.50 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.82 69.80 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.16 5.16 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.52 2.53 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.63 4.63 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.44 6.43 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.39 7.38 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.80 7.77 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.38 68.34 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.03 8.04 1.00

MUL_MAT_ID

Backend GGML op Op parameters TFLOPS master TFLOPS 0ec241d Speedup
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 0.77 0.77 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.46 4.32 2.96
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 2.75 4.59 1.67
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 0.38 1.37 3.60
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 0.13 0.66 4.90
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 5.19 5.34 1.03
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.73 2.41 3.29
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 0.16 0.80 4.88
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 0.78 0.77 0.99
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 4.68 5.85 1.25
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 8.87 8.80 0.99
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 1.29 2.45 1.89
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 0.36 0.75 2.08
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 13.96 13.94 1.00
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.51 4.32 1.72
ROCm0 MUL_MAT_ID type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 0.44 0.93 2.13
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 0.67 0.68 1.01
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 0.66 0.65 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 0.90 0.90 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 0.29 0.30 1.03
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 0.13 0.13 0.97
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 1.65 1.65 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.47 0.47 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 0.16 0.16 1.04
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 0.73 0.73 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 1.62 1.61 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 2.73 2.70 0.99
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 0.90 0.90 1.00
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 0.25 0.27 1.06
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 2.48 2.57 1.04
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 1.09 1.05 0.96
ROCm0 MUL_MAT_ID type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 0.35 0.38 1.09
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.49 1.48 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.30 1.25 0.96
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.59 1.57 0.99
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.48 1.53 1.04
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 0.96 0.97 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.40 2.39 0.99
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.18 1.20 1.02
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.24 1.32 1.06
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.60 1.60 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.70 3.73 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 5.57 5.60 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 2.62 2.64 1.01
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.30 1.14 0.88
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 9.29 9.31 1.00
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.70 2.76 1.02
ROCm0 MUL_MAT_ID type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 1.46 1.61 1.10
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880,o=1 2.87 2.86 1.00
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880,o=1 1.80 1.38 0.77
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880,o=1 15.45 15.13 0.98
ROCm0 MUL_MAT_ID type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880,o=1 1.92 2.02 1.05
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.13 1.13 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.51 1.44 0.95
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.85 1.85 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 2.00 2.03 1.02
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.54 1.76 1.14
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.88 2.86 0.99
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.23 1.27 1.03
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 2.02 2.23 1.10
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 2.15 2.15 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 4.03 3.95 0.98
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 6.40 6.37 1.00
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 3.61 3.68 1.02
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.79 1.69 0.94
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 9.89 10.01 1.01
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.95 2.83 0.96
ROCm0 MUL_MAT_ID type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 2.85 2.60 0.91
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.02 1.02 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.25 1.23 0.98
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.73 1.73 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.63 1.72 1.06
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.64 1.50 0.92
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.25 2.30 1.02
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 0.98 0.99 1.01
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.90 1.98 1.04
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.80 1.80 1.00
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.71 3.77 1.02
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 5.07 4.91 0.97
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 2.87 2.92 1.02
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.96 1.90 0.97
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 8.40 7.68 0.91
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.46 2.40 0.98
ROCm0 MUL_MAT_ID type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 2.23 2.54 1.14
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.29 1.29 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.21 1.25 1.03
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.43 1.41 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.21 1.20 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.11 1.12 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 1.73 1.71 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.05 1.08 1.03
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.42 1.40 0.98
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.40 1.39 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.22 3.21 1.00
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 3.86 3.89 1.01
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 2.18 2.15 0.99
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.30 1.35 1.04
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 5.08 5.00 0.98
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.33 2.18 0.94
ROCm0 MUL_MAT_ID type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 1.67 1.89 1.13
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1 1.06 1.06 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1 1.40 1.48 1.05
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1 1.83 1.80 0.98
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1 1.75 1.91 1.09
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1 1.39 1.32 0.95
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1 2.75 2.75 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1 1.13 1.09 0.97
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1 1.24 1.21 0.98
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1 1.81 1.81 1.00
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1 3.74 3.69 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1 5.90 5.82 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1 3.06 3.10 1.01
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1 1.36 1.55 1.13
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1 9.40 9.28 0.99
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1 2.77 2.66 0.96
ROCm0 MUL_MAT_ID type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1 1.31 2.01 1.53

@JohannesGaessler
Copy link
Collaborator

How about this: for now we move towards merging this PR but only enable it for MUL_MAT_ID where it's already faster, if in the future it also becomes faster for MUL_MAT we can then enable it for that as well.

@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Nov 9, 2025

How about this: for now we move towards merging this PR but only enable it for MUL_MAT_ID where it's already faster, if in the future it also becomes faster for MUL_MAT we can then enable it for that as well.

Thank you for the support, this is also what I'm thinking, just disable mul_mat_f on RDNA4 first and try to rewrite a RDNA4 optimized version in the future.

Also I presume that hip compiler would generate better code on RDNA3 than RDNA4, I will have a test on my 7900XTX next week.

Anyway, could youplease review it first? One thing that hip compiler cannot handle early return code, it will still compile the code after return

if  constexpr (rdna_not_supported) {
    NO_DEVICE_CODE;
    return;
}

// hip compiler will still compile this
rdna unsupported code like tile<16, 8, float> tile_A

Also, I don't see performance improvement with real model like llama3-8b-fp16 and deepseek-r1-8b-fp16 with batch 1~16, looks like that I need to do the test with batch 512, right?

@zhang-hui-yulo zhang-hui-yulo marked this pull request as ready for review November 9, 2025 10:58
@JohannesGaessler
Copy link
Collaborator

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Nov 10, 2025

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be

  1. porting MMF to RDNA3, keep the original logic to see if the performance is good enough.
  2. porting FA to RDNA4, keep the original logic to see if the performance is good enough.
  3. better MMF or FA for RDNA4 or RDNA3.

I will start them once this PR is approved.

Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK.

@JohannesGaessler
Copy link
Collaborator

@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout.

@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Nov 11, 2025

@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout.

I agree, please move forward first, I will do it based on your changes, anyway I still need sometime to find a good way to support C = B * A for RDNA4 and CDNA3, maybe add a new tile class is enough

template <int I, int J, typename T, bool trans>
class tle : tile <I, J, T> {
    int get_i () {
        if (!trans) {
            return tile <I, J, T>::get_i();
        } else {
            return tile <I, J, T>::get_j();
        }
    }
}

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After these two nitpicks are fixed and the CI passes I will approve.

@zhang-hui-yulo
Copy link
Contributor Author

Hello @JohannesGaessler , may I ask if there is anything more I need to do, looks like that vulkan test is failed, but I didn't modify any vulkan code.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to press the submit button for my review. It's really just a minor nitpick. I'll test the performance on my RX 9060 XT and then merge.

I will soon make a PR that refactors the MMA FlashAttention kernel to allow for more flexible tile sizes and adds Volta support. The MMA FA kernel on master has some bugs where the wrong variable is being used (but this randomly doesn't matter because both variables have the same value). So I would recommend you wait for that PR until you start working on MMA.

@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Nov 17, 2025

Hello @JohannesGaessler

Please move forward first for you refactor, I need sometime to find the root cause why mmf is so slow on RDNA4, your code extends k-dim by warps, I don't think it has any issue.

I probably get that rocm compiler cannot generate high performance code for RDNA4, I just add the debug code in mul_mat_f line 156 then mmf becomes way faster than before, I'm trying to find the difference in the asm code then I shall raise a bug to rocm compiler.

before

#pragma unroll
            for (int k0 = 0; k0 < warp_size; k0 += tile_A::J) {
                load_ldmatrix(A[itA][k0/tile_A::J], tile_xy + k0, tile_k_padded);
            }

after

#pragma unroll
            for (int k0 = 0; k0 < warp_size; k0 += tile_A::J) {
                load_ldmatrix(A[itA][k0/tile_A::J], tile_xy + k0, tile_k_padded);
                if (*(tile_xy + k0) != *(x + itA*tile_A::I*stride_row + col - threadIdx.x + k0) && threadIdx.y == 0 && blockIdx.x == 0 && blockIdx.y == 0) {
                    printf("%i, %i, %i\n", itA, col, k0);
                }
            }
Performance result on 9070XT with deepseek r1 f16
Model Microbatch size Test t/s mmf_wmma_rdna4 t/s 07ea6906 Speedup
qwen3 8B Q8_0 1 pp512 54.10 54.08 1.00
qwen3 8B Q8_0 2 pp512 102.56 102.32 1.00
qwen3 8B Q8_0 3 pp512 142.78 141.93 0.99
qwen3 8B Q8_0 4 pp512 183.68 183.27 1.00
qwen3 8B Q8_0 5 pp512 220.52 219.08 0.99
qwen3 8B Q8_0 6 pp512 250.75 284.89 1.14
qwen3 8B Q8_0 7 pp512 288.30 327.21 1.13
qwen3 8B Q8_0 8 pp512 317.77 357.67 1.13
qwen3 8B Q8_0 9 pp512 347.09 389.06 1.12
qwen3 8B Q8_0 10 pp512 381.14 429.97 1.13
qwen3 8B Q8_0 11 pp512 418.73 476.18 1.14
qwen3 8B Q8_0 12 pp512 450.18 511.36 1.14
qwen3 8B Q8_0 13 pp512 491.17 554.36 1.13
qwen3 8B Q8_0 14 pp512 528.75 595.68 1.13
qwen3 8B Q8_0 15 pp512 560.97 633.52 1.13
qwen3 8B Q8_0 16 pp512 587.80 683.19 1.16

Best Regards
Hui

@JohannesGaessler
Copy link
Collaborator

I'm very sorry but I'm currently traveling and I can't get my machine with the RDNA 4 GPU to start remotely using wake-on-lan. So I currently don't have a way to test performance. Merging this PR will either have to wait until Saturday when I'm back home or you'll have to run the test yourself. What I'd ask you to do is run llama-bench with the following arguments:

-r 1 -fa 1 -n 0 -ub "1-512*2" --progress -o sql|sqlite3 llama-bench.sqlite

both for a small MoE model (I suggest Granite MoE) and for any small dense model using FP16, BF16, and FP32 precision for each model. After that create a table with

python3 scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite

@zhang-hui-yulo
Copy link
Contributor Author

I'm very sorry but I'm currently traveling and I can't get my machine with the RDNA 4 GPU to start remotely using wake-on-lan. So I currently don't have a way to test performance. Merging this PR will either have to wait until Saturday when I'm back home or you'll have to run the test yourself. What I'd ask you to do is run llama-bench with the following arguments:

-r 1 -fa 1 -n 0 -ub "1-512*2" --progress -o sql|sqlite3 llama-bench.sqlite

both for a small MoE model (I suggest Granite MoE) and for any small dense model using FP16, BF16, and FP32 precision for each model. After that create a table with

python3 scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite

Don't worry about this, I will spare sometime to do the test on my 9070XT (gfx1201), as rocm compiler will generate different code for different arch in the same series, e.g., gfx1100 code is different than gfx1101, I will suggest you to also do the test on 9060XT (gfx1200).

Also, I'm still adjusting mmf optimization based on printf to let compiler use more register, the performance will increase about 50% but the code is ugly, this also needs your 9060 to evaluate the performance, please wait me for 2~3 days.

@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Nov 19, 2025

Hello @JohannesGaessler

I just write a piece of ugly macro to force rocm compiler to use more more register for mul_mat_f, could you please have a check on your 9060 to see if the result is similar on my 9070XT.

I think it's a bug of rocm compiler for gfx1201, mul_mat_f_ids seems to be fine, so I only modify mul_mat_f, if you think this change is acceptable, I will suggest to do the follow steps:

  • Submit this change into this PR and merge the PR into the main branch.
  • I will submit a bug to rocm, putting the code into the main branch will let rocm have more motivation to fix this compiler issue.
  • I will create another PR to comment the submitted ticket into MMF_REGISTER_UNROLL_FOR_RDNA and try to move some workload from mmvf to mmf

Just attach the changed file on mmf_wmma_rdna4 branch for review first.
mmf.zip

Compile command on Ubuntu 24.04.3 with ROCm 7.1.0:

// remove build folder
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1201 -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=OFF -DGGML_HIP_ROCWMMA_FATTN=ON
cmake --build build -j
mul mat performance before and after MMF_REGISTER_UNROLL_FOR_RDNA on mmf_wmma_rdna4 branch
Backend GGML op Op parameters TFLOPS 6802fbf TFLOPS mmf_wmma_rdna4 Speedup
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.20 1.20 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.66 1.65 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.98 2.49 1.26
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.49 3.11 1.25
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 91.97 93.38 1.02
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.92 4.95 1.26
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],k_v=32832,o=1 1.38 1.37 0.99
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],k_v=0,o=1 0.34 0.34 0.99
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.21 1.21 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.77 1.77 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.28 2.27 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.68 2.68 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 96.23 95.96 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.03 4.95 1.23
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.31 0.31 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.63 0.63 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.94 0.94 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.25 1.25 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.55 1.54 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.48 3.47 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.31 2.30 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.67 3.66 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.98 5.95 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.07 7.03 0.99
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.89 7.86 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.23 8.30 1.01
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 75.00 74.90 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.41 9.36 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.18 4.15 0.99
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.93 6.91 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.44 7.39 0.99
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.28 8.25 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.89 8.95 1.01
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 75.37 75.36 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.10 8.90 0.98
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.62 1.61 0.99
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.87 2.86 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.98 3.97 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.83 4.81 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.12 5.09 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.38 74.34 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.56 6.53 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.22 2.21 0.99
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.87 3.85 0.99
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.10 5.06 0.99
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.91 5.90 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.65 6.63 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.42 74.43 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.83 7.81 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.69 1.67 0.99
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.01 2.99 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.18 4.16 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.08 5.05 0.99
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.85 5.83 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.65 74.65 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.95 6.92 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.58 1.56 0.99
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.89 2.88 0.99
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.06 4.05 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.04 5.03 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.88 5.86 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.45 73.37 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.10 7.07 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.16 2.14 0.99
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.83 3.81 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.12 5.10 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.85 5.83 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.89 6.87 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.55 73.56 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.73 7.71 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.83 3.80 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.39 5.49 1.02
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.32 7.26 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.75 8.80 1.01
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.56 8.54 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.71 72.61 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.31 9.31 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.81 3.77 0.99
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.59 6.56 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.99 8.93 0.99
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.68 9.67 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.31 10.27 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.73 72.66 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.17 10.15 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.61 3.58 0.99
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.07 5.07 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.20 7.18 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.61 8.57 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.72 8.77 1.01
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.17 74.08 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.18 9.24 1.01
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.88 2.87 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.72 3.71 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.04 4.02 0.99
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.23 4.21 0.99
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.34 4.32 0.99
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.35 72.28 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.40 4.38 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.74 1.72 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.85 2.84 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.44 3.42 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.87 3.85 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.13 4.10 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.90 68.81 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.44 4.42 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.08 4.04 0.99
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.51 5.50 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.41 7.37 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.79 8.78 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.69 8.66 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.77 72.86 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.50 9.46 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.01 3.98 0.99
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.96 6.90 0.99
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.65 7.62 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.06 9.05 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.75 8.75 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.51 72.40 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.79 9.75 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.75 2.71 0.99
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.72 3.70 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.07 4.06 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.28 4.25 0.99
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.39 4.36 0.99
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.15 72.09 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.60 4.58 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.13 3.12 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.23 5.23 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.39 6.49 1.02
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.85 7.85 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.22 8.13 0.99
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 62.99 63.01 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.92 8.91 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.39 3.37 0.99
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.60 5.61 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.11 7.06 0.99
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.11 8.05 0.99
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.82 8.82 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 63.59 63.64 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.65 9.59 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.57 2.54 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.57 3.54 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.93 3.91 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.15 4.13 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.32 4.30 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.47 69.46 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.52 4.50 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.85 1.85 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.93 2.92 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.69 3.67 0.99
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.19 4.16 0.99
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.54 4.51 0.99
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.95 69.88 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.19 5.16 0.99
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.55 2.57 1.01
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.68 4.65 0.99
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.49 6.46 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.47 7.43 0.99
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.86 7.82 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.53 68.51 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.11 8.09 1.00
deepseek r1 8b fp16 performance before and after MMF_REGISTER_UNROLL_FOR_RDNA on mmf_wmma_rdna4 branch
Model Microbatch size Test t/s 6802fbf t/s mmf_wmma_rdna4 Speedup
qwen3 8B Q8_0 1 pp512 53.99 54.02 1.00
qwen3 8B Q8_0 2 pp512 102.16 102.11 1.00
qwen3 8B Q8_0 3 pp512 141.44 141.83 1.00
qwen3 8B Q8_0 4 pp512 183.09 182.65 1.00
qwen3 8B Q8_0 5 pp512 218.77 218.75 1.00
qwen3 8B Q8_0 6 pp512 249.75 287.96 1.15
qwen3 8B Q8_0 7 pp512 286.17 330.31 1.15
qwen3 8B Q8_0 8 pp512 315.15 360.81 1.14
qwen3 8B Q8_0 9 pp512 346.79 395.68 1.14
qwen3 8B Q8_0 10 pp512 380.79 433.94 1.14
qwen3 8B Q8_0 11 pp512 418.59 479.33 1.15
qwen3 8B Q8_0 12 pp512 450.05 513.97 1.14
qwen3 8B Q8_0 13 pp512 489.72 559.07 1.14
qwen3 8B Q8_0 14 pp512 528.28 603.53 1.14
qwen3 8B Q8_0 15 pp512 559.31 639.76 1.14
qwen3 8B Q8_0 16 pp512 587.03 689.59 1.17

Best Regards
Hui

@jammm
Copy link
Contributor

jammm commented Nov 19, 2025

Honestly, not very much as it isn't friendly for software developer

You can use tools like claude, gemini or chatgpt to help summarize it for you. See the rdna4 ISA at https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf. There's also tools like https://github.com/ROCm/amd_matrix_instruction_calculator which can help you understand the low level details better.

Speaking of RDNA4, I also recommend taking a look at 11.6.2. WMMA Load-Transpose Instructions in the doc. It could help in the matrix loading logic in the transpose case.

I think it's a bug of rocm compiler for gfx1201

Please feel free to create an issue on https://github.com/llvm/llvm-project/issues about it. BTW, in case you haven't already, try playing around with launch_bounds to adjust VGPR usage.

Re. benchmarks, it would be good to share benchmarks of master vs. mmf_wmma_rdna4 with and without MMF_REGISTER_UNROLL_FOR_RDNA, instead of using mmf_wmma_rdna4 as the baseline.

@zhang-hui-yulo
Copy link
Contributor Author

zhang-hui-yulo commented Nov 19, 2025

Honestly, not very much as it isn't friendly for software developer

You can use tools like claude, gemini or chatgpt to help summarize it for you. See the rdna4 ISA at https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf. There's also tools like https://github.com/ROCm/amd_matrix_instruction_calculator which can help you understand the low level details better.

Speaking of RDNA4, I also recommend taking a look at 11.6.2. WMMA Load-Transpose Instructions in the doc. It could help in the matrix loading logic in the transpose case.

I think it's a bug of rocm compiler for gfx1201

Please feel free to create an issue on https://github.com/llvm/llvm-project/issues about it. BTW, in case you haven't already, try playing around with launch_bounds to adjust VGPR usage.

Re. benchmarks, it would be good to share benchmarks of master vs. mmf_wmma_rdna4 with and without MMF_REGISTER_UNROLL_FOR_RDNA, instead of using mmf_wmma_rdna4 as the baseline.

Thank you for the tip, about the transpose load, I've read https://gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/, honestly I will suggest to add a transpose load for shared memory, it's more useful than global load.

Speak of the transpose load, I'm seeking a movematrix replacement on RDNA4 or later as the layout of mma isn't friendly for transpose in register, the one I write is extremely slow as duplicated data loading. NV's mma layout is friendly for transpose in register, it's easy to write a software version even without movmatrix instruction.

About the data of MMF_REGISTER_UNROLL_FOR_RDNA master vs. mmf_wmma_rdna4 , I think using mmf_wmma_rdna4 is enough, you can have a check on mmf.cu in the commit, mul_mat_f is disabled by default on RDNA4 as the performance drop, the submitted code in this PR is using hipblas path.

@zhang-hui-yulo
Copy link
Contributor Author

Hello @JohannesGaessler

I just write a piece of ugly macro to force rocm compiler to use more more register for mul_mat_f, could you please have a check on your 9060 to see if the result is similar on my 9070XT.

I think it's a bug of rocm compiler for gfx1201, mul_mat_f_ids seems to be fine, so I only modify mul_mat_f, if you think this change is acceptable, I will suggest to do the follow steps:

  • Submit this change into this PR and merge the PR into the main branch.
  • I will submit a bug to rocm, putting the code into the main branch will let rocm have more motivation to fix this compiler issue.
  • I will create another PR to comment the submitted ticket into MMF_REGISTER_UNROLL_FOR_RDNA and try to move some workload from mmvf to mmf

Just attach the changed file on mmf_wmma_rdna4 branch for review first. mmf.zip

Compile command on Ubuntu 24.04.3 with ROCm 7.1.0:

// remove build folder
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1201 -DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=OFF -DGGML_HIP_ROCWMMA_FATTN=ON
cmake --build build -j

mul mat performance before and after MMF_REGISTER_UNROLL_FOR_RDNA on mmf_wmma_rdna4 branch
Backend GGML op Op parameters TFLOPS 6802fbf TFLOPS mmf_wmma_rdna4 Speedup
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.20 1.20 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.66 1.65 1.00
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.98 2.49 1.26
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.49 3.11 1.25
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 91.97 93.38 1.02
ROCm0 MUL_MAT type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.92 4.95 1.26
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],k_v=32832,o=1 1.38 1.37 0.99
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],k_v=0,o=1 0.34 0.34 0.99
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.61 0.61 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.21 1.21 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.77 1.77 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.28 2.27 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.68 2.68 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 96.23 95.96 1.00
ROCm0 MUL_MAT type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.03 4.95 1.23
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.31 0.31 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.63 0.63 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 0.94 0.94 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.25 1.25 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.55 1.54 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.48 3.47 1.00
ROCm0 MUL_MAT type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.31 2.30 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.67 3.66 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.98 5.95 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.07 7.03 0.99
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.89 7.86 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.23 8.30 1.01
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 75.00 74.90 1.00
ROCm0 MUL_MAT type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.41 9.36 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.18 4.15 0.99
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.93 6.91 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.44 7.39 0.99
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.28 8.25 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.89 8.95 1.01
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 75.37 75.36 1.00
ROCm0 MUL_MAT type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.10 8.90 0.98
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.62 1.61 0.99
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.87 2.86 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.98 3.97 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.83 4.81 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.12 5.09 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.38 74.34 1.00
ROCm0 MUL_MAT type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.56 6.53 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.22 2.21 0.99
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.87 3.85 0.99
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.10 5.06 0.99
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.91 5.90 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.65 6.63 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.42 74.43 1.00
ROCm0 MUL_MAT type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.83 7.81 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.69 1.67 0.99
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.01 2.99 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.18 4.16 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.08 5.05 0.99
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.85 5.83 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.65 74.65 1.00
ROCm0 MUL_MAT type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.95 6.92 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.58 1.56 0.99
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.89 2.88 0.99
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.06 4.05 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.04 5.03 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.88 5.86 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.45 73.37 1.00
ROCm0 MUL_MAT type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.10 7.07 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.16 2.14 0.99
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.83 3.81 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.12 5.10 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.85 5.83 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.89 6.87 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 73.55 73.56 1.00
ROCm0 MUL_MAT type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.73 7.71 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.83 3.80 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.39 5.49 1.02
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.32 7.26 0.99
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.75 8.80 1.01
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.56 8.54 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.71 72.61 1.00
ROCm0 MUL_MAT type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.31 9.31 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.81 3.77 0.99
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.59 6.56 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.99 8.93 0.99
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.68 9.67 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.31 10.27 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.73 72.66 1.00
ROCm0 MUL_MAT type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 10.17 10.15 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.61 3.58 0.99
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.07 5.07 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.20 7.18 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.61 8.57 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.72 8.77 1.01
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 74.17 74.08 1.00
ROCm0 MUL_MAT type_a=mxfp4,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.18 9.24 1.01
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.88 2.87 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.72 3.71 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.04 4.02 0.99
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.23 4.21 0.99
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.34 4.32 0.99
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.35 72.28 1.00
ROCm0 MUL_MAT type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.40 4.38 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.74 1.72 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.85 2.84 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.44 3.42 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.87 3.85 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.13 4.10 0.99
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.90 68.81 1.00
ROCm0 MUL_MAT type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.44 4.42 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.08 4.04 0.99
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.51 5.50 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.41 7.37 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.79 8.78 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.69 8.66 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.77 72.86 1.00
ROCm0 MUL_MAT type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.50 9.46 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.01 3.98 0.99
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.96 6.90 0.99
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.65 7.62 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.06 9.05 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.75 8.75 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.51 72.40 1.00
ROCm0 MUL_MAT type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.79 9.75 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.75 2.71 0.99
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.72 3.70 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.07 4.06 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.28 4.25 0.99
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.39 4.36 0.99
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 72.15 72.09 1.00
ROCm0 MUL_MAT type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.60 4.58 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.13 3.12 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.23 5.23 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.39 6.49 1.02
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.85 7.85 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.22 8.13 0.99
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 62.99 63.01 1.00
ROCm0 MUL_MAT type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.92 8.91 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.39 3.37 0.99
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.60 5.61 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.11 7.06 0.99
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.11 8.05 0.99
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.82 8.82 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 63.59 63.64 1.00
ROCm0 MUL_MAT type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 9.65 9.59 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.57 2.54 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.57 3.54 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.93 3.91 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.15 4.13 0.99
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.32 4.30 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.47 69.46 1.00
ROCm0 MUL_MAT type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.52 4.50 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 1.85 1.85 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.93 2.92 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 3.69 3.67 0.99
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.19 4.16 0.99
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.54 4.51 0.99
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 69.95 69.88 1.00
ROCm0 MUL_MAT type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 5.19 5.16 0.99
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 2.55 2.57 1.01
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 4.68 4.65 0.99
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 6.49 6.46 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.47 7.43 0.99
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 7.86 7.82 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 68.53 68.51 1.00
ROCm0 MUL_MAT type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1 8.11 8.09 1.00
deepseek r1 8b fp16 performance before and after MMF_REGISTER_UNROLL_FOR_RDNA on mmf_wmma_rdna4 branch
Model Microbatch size Test t/s 6802fbf t/s mmf_wmma_rdna4 Speedup
qwen3 8B Q8_0 1 pp512 53.99 54.02 1.00
qwen3 8B Q8_0 2 pp512 102.16 102.11 1.00
qwen3 8B Q8_0 3 pp512 141.44 141.83 1.00
qwen3 8B Q8_0 4 pp512 183.09 182.65 1.00
qwen3 8B Q8_0 5 pp512 218.77 218.75 1.00
qwen3 8B Q8_0 6 pp512 249.75 287.96 1.15
qwen3 8B Q8_0 7 pp512 286.17 330.31 1.15
qwen3 8B Q8_0 8 pp512 315.15 360.81 1.14
qwen3 8B Q8_0 9 pp512 346.79 395.68 1.14
qwen3 8B Q8_0 10 pp512 380.79 433.94 1.14
qwen3 8B Q8_0 11 pp512 418.59 479.33 1.15
qwen3 8B Q8_0 12 pp512 450.05 513.97 1.14
qwen3 8B Q8_0 13 pp512 489.72 559.07 1.14
qwen3 8B Q8_0 14 pp512 528.28 603.53 1.14
qwen3 8B Q8_0 15 pp512 559.31 639.76 1.14
qwen3 8B Q8_0 16 pp512 587.03 689.59 1.17
Best Regards Hui

Hello @JohannesGaessler, any comment for this proposal? If you don't object, I prefer to push this commit into this PR, then I can do the final test based on your requirement before merge, thank you.

@JohannesGaessler
Copy link
Collaborator

Let's keep this PR simple, please. From my end I've already reviewed it as-is I just currently cannot test the performance. And it's easier for me to rebase my code on top of this PR once it's on master rather than to keep stacking changes on divergent branches. So I would prefer if the thing you're describing was done in a follow-up PR.

@zhang-hui-yulo
Copy link
Contributor Author

Let's keep this PR simple, please. From my end I've already reviewed it as-is I just currently cannot test the performance. And it's easier for me to rebase my code on top of this PR once it's on master rather than to keep stacking changes on divergent branches. So I would prefer if the thing you're describing was done in a follow-up PR.

Got it, I will test the model tomorrow, once this PR is merged, I will create another PR to enable mat_mul_f for RDNA4.

@zhang-hui-yulo
Copy link
Contributor Author

I'm very sorry but I'm currently traveling and I can't get my machine with the RDNA 4 GPU to start remotely using wake-on-lan. So I currently don't have a way to test performance. Merging this PR will either have to wait until Saturday when I'm back home or you'll have to run the test yourself. What I'd ask you to do is run llama-bench with the following arguments:

-r 1 -fa 1 -n 0 -ub "1-512*2" --progress -o sql|sqlite3 llama-bench.sqlite

both for a small MoE model (I suggest Granite MoE) and for any small dense model using FP16, BF16, and FP32 precision for each model. After that create a table with

python3 scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite

Hello @JohannesGaessler ,

I've finished the test based on your requirement, here is the data, if it's good, could you merge this PR? Then I will create a new PR to optimize and enable mat_mul_f for RDNA4, thank you.

https://huggingface.co/ibm-granite/granite-3.1-1b-a400m-instruct

bf16
GPU Model Microbatch size Test t/s master t/s mmf_wmma_rdna4 Speedup
RX 9070 XT granitemoe ?B BF16 1 pp512 219.36 219.96 1.00
RX 9070 XT granitemoe ?B BF16 2 pp512 122.25 380.32 3.11
RX 9070 XT granitemoe ?B BF16 4 pp512 211.77 696.45 3.29
RX 9070 XT granitemoe ?B BF16 8 pp512 356.37 1249.76 3.51
RX 9070 XT granitemoe ?B BF16 16 pp512 571.52 1852.40 3.24
RX 9070 XT granitemoe ?B BF16 32 pp512 1075.71 3225.50 3.00
RX 9070 XT granitemoe ?B BF16 64 pp512 1872.01 4540.26 2.43
RX 9070 XT granitemoe ?B BF16 128 pp512 3219.74 6824.77 2.12
RX 9070 XT granitemoe ?B BF16 256 pp512 5516.22 10610.57 1.92
RX 9070 XT granitemoe ?B BF16 512 pp512 8743.02 13506.05 1.54
f16
GPU Model Microbatch size Test t/s master t/s mmf_wmma_rdna4 Speedup
RX 9070 XT granitemoe ?B F16 1 pp512 220.13 222.33 1.01
RX 9070 XT granitemoe ?B F16 2 pp512 136.71 385.15 2.82
RX 9070 XT granitemoe ?B F16 4 pp512 235.22 677.37 2.88
RX 9070 XT granitemoe ?B F16 8 pp512 428.82 1328.02 3.10
RX 9070 XT granitemoe ?B F16 16 pp512 684.12 1965.15 2.87
RX 9070 XT granitemoe ?B F16 32 pp512 1290.66 3388.88 2.63
RX 9070 XT granitemoe ?B F16 64 pp512 2284.57 4891.27 2.14
RX 9070 XT granitemoe ?B F16 128 pp512 3906.92 7383.18 1.89
RX 9070 XT granitemoe ?B F16 256 pp512 6833.83 11318.59 1.66
RX 9070 XT granitemoe ?B F16 512 pp512 9745.38 13906.40 1.43
f32
GPU Model Microbatch size Test t/s master t/s mmf_wmma_rdna4 Speedup
RX 9070 XT granitemoe ?B all F32 1 pp512 233.70 233.54 1.00
RX 9070 XT granitemoe ?B all F32 2 pp512 129.47 129.24 1.00
RX 9070 XT granitemoe ?B all F32 4 pp512 190.47 189.91 1.00
RX 9070 XT granitemoe ?B all F32 8 pp512 284.44 283.31 1.00
RX 9070 XT granitemoe ?B all F32 16 pp512 394.93 396.71 1.00
RX 9070 XT granitemoe ?B all F32 32 pp512 684.60 674.48 0.99
RX 9070 XT granitemoe ?B all F32 64 pp512 1161.88 1155.93 0.99
RX 9070 XT granitemoe ?B all F32 128 pp512 1520.65 1503.93 0.99
RX 9070 XT granitemoe ?B all F32 256 pp512 2322.70 2302.76 0.99
RX 9070 XT granitemoe ?B all F32 512 pp512 3070.30 3056.85 1.00

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

bf16
GPU Model Microbatch size Test t/s master t/s mmf_wmma_rdna4 Speedup
RX 9070 XT qwen2 1.5B BF16 1 pp512 156.85 157.36 1.00
RX 9070 XT qwen2 1.5B BF16 2 pp512 249.60 250.06 1.00
RX 9070 XT qwen2 1.5B BF16 4 pp512 375.44 375.14 1.00
RX 9070 XT qwen2 1.5B BF16 8 pp512 744.74 744.64 1.00
RX 9070 XT qwen2 1.5B BF16 16 pp512 1466.33 1466.41 1.00
RX 9070 XT qwen2 1.5B BF16 32 pp512 2800.40 2801.61 1.00
RX 9070 XT qwen2 1.5B BF16 64 pp512 4756.91 4725.77 0.99
RX 9070 XT qwen2 1.5B BF16 128 pp512 7874.19 7881.27 1.00
RX 9070 XT qwen2 1.5B BF16 256 pp512 11280.40 11417.67 1.01
RX 9070 XT qwen2 1.5B BF16 512 pp512 14550.38 14548.77 1.00
f16
GPU Model Microbatch size Test t/s master t/s mmf_wmma_rdna4 Speedup
RX 9070 XT qwen2 1.5B F16 1 pp512 156.62 156.64 1.00
RX 9070 XT qwen2 1.5B F16 2 pp512 244.89 245.62 1.00
RX 9070 XT qwen2 1.5B F16 4 pp512 398.09 397.34 1.00
RX 9070 XT qwen2 1.5B F16 8 pp512 797.27 794.64 1.00
RX 9070 XT qwen2 1.5B F16 16 pp512 1593.95 1590.22 1.00
RX 9070 XT qwen2 1.5B F16 32 pp512 3024.40 3024.65 1.00
RX 9070 XT qwen2 1.5B F16 64 pp512 5148.45 5153.55 1.00
RX 9070 XT qwen2 1.5B F16 128 pp512 8440.55 8480.39 1.00
RX 9070 XT qwen2 1.5B F16 256 pp512 12806.48 12623.57 0.99
RX 9070 XT qwen2 1.5B F16 512 pp512 18394.79 18357.23 1.00
f32
GPU Model Microbatch size Test t/s master t/s mmf_wmma_rdna4 Speedup
RX 9070 XT qwen2 1.5B all F32 1 pp512 99.65 99.63 1.00
RX 9070 XT qwen2 1.5B all F32 2 pp512 185.29 185.13 1.00
RX 9070 XT qwen2 1.5B all F32 4 pp512 343.06 342.11 1.00
RX 9070 XT qwen2 1.5B all F32 8 pp512 494.18 494.05 1.00
RX 9070 XT qwen2 1.5B all F32 16 pp512 725.68 725.01 1.00
RX 9070 XT qwen2 1.5B all F32 32 pp512 1146.47 1144.16 1.00
RX 9070 XT qwen2 1.5B all F32 64 pp512 1404.70 1407.22 1.00
RX 9070 XT qwen2 1.5B all F32 128 pp512 846.03 844.38 1.00
RX 9070 XT qwen2 1.5B all F32 256 pp512 1263.08 1267.08 1.00
RX 9070 XT qwen2 1.5B all F32 512 pp512 1367.25 1364.48 1.00

Best Regards
Hui

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I'll merge as soon as the CI passes. (Also sorry for forgetting that the code is currently not being used for dense models, you wouldn't actually have had to test those.)

@zhang-hui-yulo
Copy link
Contributor Author

Thank you, I'll merge as soon as the CI passes. (Also sorry for forgetting that the code is currently not being used for dense models, you wouldn't actually have had to test those.)

Thank you for the support, don't worry about the dense models, I will submit another PR to enable it once this PR is merged.

@unverbraucht
Copy link

I made a simple attempt at porting this great work to my RX 7900 XT (RDNA3). A quick test shows an amazing speed-up, I'll have to do more tests to see how this pans out:

GPU Model Microbatch size Test t/s master t/s rdna4-wmma Speedup
RX 7900 XT granitemoe ?B F16 1 pp512 285.39 282.26 0.99
RX 7900 XT granitemoe ?B F16 2 pp512 124.73 700.05 5.61
RX 7900 XT granitemoe ?B F16 4 pp512 205.62 1164.44 5.66
RX 7900 XT granitemoe ?B F16 8 pp512 343.39 1956.24 5.70
RX 7900 XT granitemoe ?B F16 16 pp512 640.36 3289.75 5.14
RX 7900 XT granitemoe ?B F16 32 pp512 1062.56 5962.28 5.61
RX 7900 XT granitemoe ?B F16 64 pp512 2045.46 10442.43 5.11
RX 7900 XT granitemoe ?B F16 128 pp512 3620.09 16930.42 4.68
RX 7900 XT granitemoe ?B F16 256 pp512 6008.61 23723.27 3.95
RX 7900 XT granitemoe ?B F16 512 pp512 9207.02 29515.52 3.21

Don't want to hijack this PR, please let me know if now is a good time open a PR for RDNA3. In any case I'll wait until this is merged.

@JohannesGaessler
Copy link
Collaborator

Currently I'm in the process of reducing technical debt in the FlashAttention code by reducing the number of kernels that I need to maintain. I want to drop the kernel in fattn-wmma-f16.cu since the design is fundamentally bad and I only wrote it like this because I didn't know better. However, as of right now it's still used for Volta, RDNA3/4, and CDNA. I'll soon make a PR that adds Volta support to the comparatively much better kernel in fattn-mma-f16.cuh. During that PR I will make some changes to the templates in mma.cuh. Thank you for the offer to help, I'll notify you when I make the PR.

@JohannesGaessler JohannesGaessler merged commit 028f93e into ggml-org:master Nov 21, 2025
74 checks passed
@jiachengjason
Copy link
Contributor

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be

  1. porting MMF to RDNA3, keep the original logic to see if the performance is good enough.
  2. porting FA to RDNA4, keep the original logic to see if the performance is good enough.
  3. better MMF or FA for RDNA4 or RDNA3.

I will start them once this PR is approved.

Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK.

Hi @zhang-hui-yulo I am also in the middle of implementing WMMA instructions on RDNA3, let's connect to prevent duplicated efforts. Please connect at [email protected]

@zhang-hui-yulo
Copy link
Contributor Author

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be

  1. porting MMF to RDNA3, keep the original logic to see if the performance is good enough.
  2. porting FA to RDNA4, keep the original logic to see if the performance is good enough.
  3. better MMF or FA for RDNA4 or RDNA3.

I will start them once this PR is approved.
Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK.

Hi @zhang-hui-yulo I am also in the middle of implementing WMMA instructions on RDNA3, let's connect to prevent duplicated efforts. Please connect at [email protected]

Hello, @jiachengjason , I haven't start it yet as I'm still focusing on RDNA4, the one you shall align with is @unverbraucht , looks like he has finished this workload in his private repo.

@unverbraucht
Copy link

@jiachengjason I did a naive implementation and in my testing I saw great performance, even exceeding what @zhang-hui-yulo saw here for MoE models. I'll open a PR and I'd be glad for help. I'm a novice when it comes to HIP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants