-
Notifications
You must be signed in to change notification settings - Fork 13.9k
HIP: RDNA4 tensor core support for MMF #17077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP: RDNA4 tensor core support for MMF #17077
Conversation
On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.
Yes sorry, that was a bug that I introduced. |
Thank you for the tip, AFAIK, tensor core on RDNA3 uses the same silicon of vector instructions, RDNA4 redesigns the tensor core and makes it more like CDNA. But at least it shall not be slower than hipblas, I shall spend sometime to find out the root cause, at least I know that hip compiler doesn't acquire register very well. |
|
Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated |
Thank you for the tips, there is little info for AMD bank layout, based on the limited document I have, RDNA3 has 32 banks in CU mode and 54 banks in WGP mode, WGP mode is the default one, the bank width is DWORD, I don't have any doc for RDNA4 so I assume it shall be similar as before, so I don't change any code logic in mmf, just adapter the wmma instruction. Based on the wmma layout of RDNA4, I just keep the old ldmatrix logic and use vectorized load in load_generic, honestly I'm not sure if shared memory bank conflict is the root cause. |
|
Are you aware of the AMD ISA documentation? |
Honestly, not very much as it isn't friendly for software developer, based on the gemm benchmark on my modified nvidia cute for RDNA3, the bank layout is same as nvidia ampere. Just have a check on RDNA4 ISA So, it's 32 banks in CU mode and 64 banks in WGP mode, the bank width is 4 bytes. |
|
I think I've found the root cause, mat_mmf_f is slower than hipblas, mat_mmf_ids is better than hipblas. MUL_MAT
MUL_MAT_ID
|
|
How about this: for now we move towards merging this PR but only enable it for |
Thank you for the support, this is also what I'm thinking, just disable mul_mat_f on RDNA4 first and try to rewrite a RDNA4 optimized version in the future. Also I presume that hip compiler would generate better code on RDNA3 than RDNA4, I will have a test on my 7900XTX next week. Anyway, could youplease review it first? One thing that hip compiler cannot handle early return code, it will still compile the code after return Also, I don't see performance improvement with real model like llama3-8b-fp16 and deepseek-r1-8b-fp16 with batch 1~16, looks like that I need to do the test with batch 512, right? |
|
@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts. |
Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be
I will start them once this PR is approved. Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK. |
|
@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout. |
I agree, please move forward first, I will do it based on your changes, anyway I still need sometime to find a good way to support C = B * A for RDNA4 and CDNA3, maybe add a new tile class is enough |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After these two nitpicks are fixed and the CI passes I will approve.
|
Hello @JohannesGaessler , may I ask if there is anything more I need to do, looks like that vulkan test is failed, but I didn't modify any vulkan code. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I forgot to press the submit button for my review. It's really just a minor nitpick. I'll test the performance on my RX 9060 XT and then merge.
I will soon make a PR that refactors the MMA FlashAttention kernel to allow for more flexible tile sizes and adds Volta support. The MMA FA kernel on master has some bugs where the wrong variable is being used (but this randomly doesn't matter because both variables have the same value). So I would recommend you wait for that PR until you start working on MMA.
|
Hello @JohannesGaessler Please move forward first for you refactor, I need sometime to find the root cause why mmf is so slow on RDNA4, your code extends k-dim by warps, I don't think it has any issue. I probably get that rocm compiler cannot generate high performance code for RDNA4, I just add the debug code in mul_mat_f line 156 then mmf becomes way faster than before, I'm trying to find the difference in the asm code then I shall raise a bug to rocm compiler. before after Performance result on 9070XT with deepseek r1 f16
Best Regards |
|
I'm very sorry but I'm currently traveling and I can't get my machine with the RDNA 4 GPU to start remotely using wake-on-lan. So I currently don't have a way to test performance. Merging this PR will either have to wait until Saturday when I'm back home or you'll have to run the test yourself. What I'd ask you to do is run -r 1 -fa 1 -n 0 -ub "1-512*2" --progress -o sql|sqlite3 llama-bench.sqliteboth for a small MoE model (I suggest Granite MoE) and for any small dense model using FP16, BF16, and FP32 precision for each model. After that create a table with python3 scripts/compare-llama-bench.py -s gpu_info,model_type,n_ubatch -i llama-bench.sqlite |
Don't worry about this, I will spare sometime to do the test on my 9070XT (gfx1201), as rocm compiler will generate different code for different arch in the same series, e.g., gfx1100 code is different than gfx1101, I will suggest you to also do the test on 9060XT (gfx1200). Also, I'm still adjusting mmf optimization based on printf to let compiler use more register, the performance will increase about 50% but the code is ugly, this also needs your 9060 to evaluate the performance, please wait me for 2~3 days. |
|
Hello @JohannesGaessler I just write a piece of ugly macro to force rocm compiler to use more more register for mul_mat_f, could you please have a check on your 9060 to see if the result is similar on my 9070XT. I think it's a bug of rocm compiler for gfx1201, mul_mat_f_ids seems to be fine, so I only modify mul_mat_f, if you think this change is acceptable, I will suggest to do the follow steps:
Just attach the changed file on mmf_wmma_rdna4 branch for review first. Compile command on Ubuntu 24.04.3 with ROCm 7.1.0: mul mat performance before and after MMF_REGISTER_UNROLL_FOR_RDNA on mmf_wmma_rdna4 branch
deepseek r1 8b fp16 performance before and after MMF_REGISTER_UNROLL_FOR_RDNA on mmf_wmma_rdna4 branch
Best Regards |
You can use tools like claude, gemini or chatgpt to help summarize it for you. See the rdna4 ISA at https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf. There's also tools like https://github.com/ROCm/amd_matrix_instruction_calculator which can help you understand the low level details better. Speaking of RDNA4, I also recommend taking a look at
Please feel free to create an issue on https://github.com/llvm/llvm-project/issues about it. BTW, in case you haven't already, try playing around with launch_bounds to adjust VGPR usage. Re. benchmarks, it would be good to share benchmarks of master vs. mmf_wmma_rdna4 with and without |
Thank you for the tip, about the transpose load, I've read https://gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/, honestly I will suggest to add a transpose load for shared memory, it's more useful than global load. Speak of the transpose load, I'm seeking a movematrix replacement on RDNA4 or later as the layout of mma isn't friendly for transpose in register, the one I write is extremely slow as duplicated data loading. NV's mma layout is friendly for transpose in register, it's easy to write a software version even without movmatrix instruction. About the data of MMF_REGISTER_UNROLL_FOR_RDNA master vs. mmf_wmma_rdna4 , I think using mmf_wmma_rdna4 is enough, you can have a check on mmf.cu in the commit, mul_mat_f is disabled by default on RDNA4 as the performance drop, the submitted code in this PR is using hipblas path. |
Hello @JohannesGaessler, any comment for this proposal? If you don't object, I prefer to push this commit into this PR, then I can do the final test based on your requirement before merge, thank you. |
|
Let's keep this PR simple, please. From my end I've already reviewed it as-is I just currently cannot test the performance. And it's easier for me to rebase my code on top of this PR once it's on master rather than to keep stacking changes on divergent branches. So I would prefer if the thing you're describing was done in a follow-up PR. |
Got it, I will test the model tomorrow, once this PR is merged, I will create another PR to enable mat_mul_f for RDNA4. |
Hello @JohannesGaessler , I've finished the test based on your requirement, here is the data, if it's good, could you merge this PR? Then I will create a new PR to optimize and enable mat_mul_f for RDNA4, thank you. https://huggingface.co/ibm-granite/granite-3.1-1b-a400m-instruct bf16
f16
f32
https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B bf16
f16
f32
Best Regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I'll merge as soon as the CI passes. (Also sorry for forgetting that the code is currently not being used for dense models, you wouldn't actually have had to test those.)
Thank you for the support, don't worry about the dense models, I will submit another PR to enable it once this PR is merged. |
|
I made a simple attempt at porting this great work to my RX 7900 XT (RDNA3). A quick test shows an amazing speed-up, I'll have to do more tests to see how this pans out:
Don't want to hijack this PR, please let me know if now is a good time open a PR for RDNA3. In any case I'll wait until this is merged. |
|
Currently I'm in the process of reducing technical debt in the FlashAttention code by reducing the number of kernels that I need to maintain. I want to drop the kernel in |
Hi @zhang-hui-yulo I am also in the middle of implementing WMMA instructions on RDNA3, let's connect to prevent duplicated efforts. Please connect at [email protected] |
Hello, @jiachengjason , I haven't start it yet as I'm still focusing on RDNA4, the one you shall align with is @unverbraucht , looks like he has finished this workload in his private repo. |
|
@jiachengjason I did a naive implementation and in my testing I saw great performance, even exceeding what @zhang-hui-yulo saw here for MoE models. I'll open a PR and I'd be glad for help. I'm a novice when it comes to HIP. |
Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF
@JohannesGaessler, looks like that #16988 changes mmf.cu to
then native mmf won't be excised on my RDNA4, it always uses hipblas path.