cuBLAS memory management update #1149

cmp-nct · 2023-04-24T02:33:29Z

I changed the memory management
The current variant only supports 16 allocated free buffers and it uses the first free one even if a better size is available.

The new method comes with a couple changes:

It upps the buffers to 512 (previously this was limited as it was going through a loop)
introduces a size check (not enforced like previously with buffers)
maintains a size ordered list of allocated available memory buffers
uses binary search to insert and to take memory blocks
It currently hardcodes 2GB of VRAM to match all cards, when RAM or buffer count is exceeded it will STDERR like the previous variant.

I tested it on a generic example with thousands of buffers and with llama and it appears to work fine.
It's quite a bit of fresh code so no guarantees

slaren · 2023-04-24T05:36:19Z

This is my plan to improve the cuBLAS performance and memory usage:
Knowing that we only perform one mat mul at a time, we don't need more than 4 buffers, one for each of the d_X, d_Y, d_D and d_Q matrices. If we are able to predict the maximum sizes correctly, we would never need to allocate more than these four buffers, and this can be done fairly easily by adding some checks to ggml_graph_compute. Additionally, to avoid having to re-upload the model weights constantly and use the available VRAM more efficiently, the d_Q buffer could be replaced by a cache. That's what I have been working towards in my cuda-cache branch.

So for what I am doing, there is no need at all to have more buffers, on the contrary, the memory pool code could be simplified further once we know that we will never need to reallocate these buffers. Do you have any specific plans that would benefit from these changes?

SlyEcho · 2023-04-24T11:51:20Z

We could also investigate some more advanced memory management like cudaMallocManaged to have a single pointer in both device and host memory space, we don't need to copy the data manually from one place to the other but just use the prefetch command to make it available on device or host if required.

Performance-wise I don't know if it would make a difference.

cmp-nct · 2023-04-24T12:53:53Z

I think a prefetch / cache is certainly the way to go, there is a ton of improvements to the current implementation.

Regarding my variant: It was not tailored to llama, I'm just posting it here as the only ggml project that actually uses cuBLAS currently.
If you use the same 4 buffers each time it won't be necessary but sooner or later another idea will come up and more buffers will be used, or another project will use cuBLAS and need more than a few buffers.

The current implementation would assign a 4GB allocation to a 1MB malloc request if it's the first one available.

So it's more a ggml improvement than a llama improvement.

slaren · 2023-04-24T16:57:42Z

@SlyEcho I gave that I try but cudaMemPrefetchAsync doesn't work on my machine and without it the performance is very bad, so at least for me this doesn't seem to be a workable solution.

SlyEcho · 2023-04-24T17:33:20Z

@slaren that sucks.

What about cudaHostAlloc() and cudaHostGetDevicePointer()?

John added 2 commits April 24, 2023 03:37

cuBLAS memory management

7fcfba2

cuBLAS memory management routines

6e292d7

cmp-nct closed this May 22, 2023

cmp-nct deleted the vanilla_2 branch May 22, 2023 19:32

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS memory management update #1149

cuBLAS memory management update #1149

cmp-nct commented Apr 24, 2023 •

edited

Loading

slaren commented Apr 24, 2023

SlyEcho commented Apr 24, 2023

cmp-nct commented Apr 24, 2023

slaren commented Apr 24, 2023

SlyEcho commented Apr 24, 2023

cuBLAS memory management update #1149

cuBLAS memory management update #1149

Conversation

cmp-nct commented Apr 24, 2023 • edited Loading

slaren commented Apr 24, 2023

SlyEcho commented Apr 24, 2023

cmp-nct commented Apr 24, 2023

slaren commented Apr 24, 2023

SlyEcho commented Apr 24, 2023

cmp-nct commented Apr 24, 2023 •

edited

Loading