-
Notifications
You must be signed in to change notification settings - Fork 11.7k
VRAM optimization + matrix multiplication discussion #1935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I briefly wrote on this topic here: #1867 (comment) (ignore the Metal-specific parts). It is something that we definitely want to implement in |
Regarding the allocation order: yes, it is bad. A while ago, I fixed that in a (hopelessly outdated) branch by calculating the maximum required memory during the initial pass in The logic is a lot more complicated now so that won't be so easy, but it should still be possible to do this using a similar approach. To also avoid wasting memory when the batch size increases, you could just free the entire pool and reallocate it again when that happens. Or just make an initial dry run with maximum batch size. I agree that eventually the best solution will be to implement our own mat muls kernels. The best way to learn how to do that may be to look into how CUTLASS does it. CUTLASS is an open source library that contains many of the kernels used in cuBLAS. It would also be possible to use CUTLASS directly, but it is a heavy dependency that we are probably not interesting in adding to ggml. This article may also be interesting: https://siboehm.com/articles/22/CUDA-MMM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer to let @slaren and @ggerganov decide if to merge. The change reduces VRAM usage for the larger models, but increases for the 7B model (and I guess even more relative to master for e.g. OpenLLaMA 3B). Which means that people with low-end devices will be able to fit even fewer layers on the GPU.
I don't mean for this to get merged; I just thought that for the broader discussion of what to do about VRAM usage the effects of a simple attempt at fixing it would be relevant so I made a PR that combines discussion with code. |
Isn't the simplest possible implementation of the product between a quantized matrix and a float32 matrix simply a series of the dot products we already have? Before even going into trying to optimize it by using some of the tricks utilized in the various BLAS implementations, doesn't it make sense to first try this simplest possible version and see how performance compares against what we currently do (dequantize plus cuBLAS)? |
As a follow up on my previous comment, here is an interesting data point: on a Ryzen 7950X CPU, perplexity calculation is faster if I disable OPEN_BLAS. Which means that on that CPU, the simplest possible matrix multiplication logic as implemented in |
I'm primarily thinking about the CUDA implementation since I'm trying to optimize VRAM. For that, when I profiled it dequantization + cuBLAS on average takes 5.5 µs per call and token for 33b q4_0. The runtime of just applying dot products would be comparable to that of a dequantize_mul_mat_vec_kernel which sits at 84.7 µs per call and token. So doing it like that would literally be 15 times slower because you'd be processing the prompt at the same speed that you generate tokens at. The profiling data also suggests that the actual matrix multiplication takes ~4 times as long as the dequantization. And since the dequantization is essentially just I/O that could be optimized away by tensor fusion a kernel that reaches 80% of the performance of cuBLAS should make the program faster overall (the actual threshold is lower because you also save some I/O for the matrix multiplication). |
On my system also the performance with OpenBLAS is unexpectedly bad but it's still faster than the ggml implementation. However, to me this suggests that something is wrong with OpenBLAS or how it's used in ggml rather than that the simple algorithm is good. Either that or the matrices used are too small to be sensitive to the difference. If you just do one dot product per entry in the matrix that you want to calculate you end up with a lot of potential cache misses which kills your performance once the matrix gets too big to fit into cache. |
I'm not sure I can follow the math. Let's look at token prediction. On my RTX-4080 this takes 9 ms/token for |
This is where your logic is going wrong. When you multiply a square matrix of size Caveat: there are algorithms that need less than |
@JohannesGaessler
That combined frees up about 1 GB of VRAM that can be used for quantized offloading instead.
|
@cmp-nct There definitely are ways to optimize the current method for prompt processing. I'm just thinking that a custom matrix multiplication kernel would make such optimizations unnecessary so I wanted to talk about that. |
That would only be true if the two matrices fit completely in fast cache, no? If we want to do big-O analysis, lets look at a processor where exactly one row from the left matrix ( When I suggested that the simplest possible matrix multiplication is just a series of dot products, I did not mean literally just individual dot products. Instead, a relatively simple change is to make the dot product function multiply |
It is true regardless of how much cache you have. In practice however the amount of cache/CUDA shared memory is a limitation. For e.g. tiling algorithms you get better performance with larger tiles but those tiles have to fit into cache/VRAM to properly reduce memory accesses.
You're free to try but I highly doubt it. |
I've a half-hacky solution on ggllm.cpp until a full mul-mat integer kernel is available.
I integrated the f16 multiplication using a 32 bit wrapper and by exchanging the function pointer right in ggml_cuda_op(), that's the hacky part but anything else would have changed my ggml-cuda implementation too much to maintain later. I'm not super happy that I duplicated all kernels, in hindsight I think maybe a 32->16 wrapper around them would also have been possible, overall those two changes saved gigabytes of VRAM (depending on parameters used and model size). I'm not sure but wouldn't it make sense to use 16 bit for everything in ggml-cuda ? 32 bit seems so wasteful to me Maybe some of it is useable: |
@JohannesGaessler Is this related to what I asked? #2118 |
I don't understand Chinese so I have no idea what that repository is doing. Even if I did, that project is very likely not directly comparable to ggml-based projects and I don't want to dig through the source code to find out the differences. |
Superseded by #2160 . |
Currently the CUDA code allocates temporary buffers during prompt processing via
ggml_cuda_pool_malloc
to hold the weight matrices dequantized to f32 for cuBLAS. As it turns out the amount of VRAM used for these temporary buffers is quite substantial, more than 1 GiB for 33b:One of the problems is that the buffers are allocated in a bad order: there are three relevant sizes for the dequantized matrices and they are allocated from smallest to largest. As a consequence the VRAM allocated for the first two matrices is essentially wasted. In this PR I did a hacky patch that just allocates and frees a single 813 MiB buffer during initialization that can then be reused for all three sizes. For 33b this does reduce VRAM usage by ~600 MiB but 813 MiB is still a lot for temporary buffers that are only used during prompt processing. So I think that a proper solution would be to implement matrix multiplication that does dequantization on the fly the same way that the dequantize mul mat vec kernel does. The problem is that efficiently parallelizing general matrix multiplication is very hard.
This brings me to the topic at hand: does anyone have a good idea for how to fuse dequantization and general matrix multiplication in ggml? I think I could at the very least do a basic implementation that at least greatly reduces VRAM usage but it may perform significantly worse for prompt processing, especially for GPUs without tensor cores. Ideally I would want to implement a kernel that is efficient both in terms of VRAM and speed.