Skip to content

cuBLAS memory management update #1149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

cuBLAS memory management update #1149

wants to merge 2 commits into from

Conversation

cmp-nct
Copy link
Contributor

@cmp-nct cmp-nct commented Apr 24, 2023

I changed the memory management
The current variant only supports 16 allocated free buffers and it uses the first free one even if a better size is available.

The new method comes with a couple changes:

  1. It upps the buffers to 512 (previously this was limited as it was going through a loop)
  2. introduces a size check (not enforced like previously with buffers)
  3. maintains a size ordered list of allocated available memory buffers
  4. uses binary search to insert and to take memory blocks
    It currently hardcodes 2GB of VRAM to match all cards, when RAM or buffer count is exceeded it will STDERR like the previous variant.

I tested it on a generic example with thousands of buffers and with llama and it appears to work fine.
It's quite a bit of fresh code so no guarantees

@slaren
Copy link
Member

slaren commented Apr 24, 2023

This is my plan to improve the cuBLAS performance and memory usage:
Knowing that we only perform one mat mul at a time, we don't need more than 4 buffers, one for each of the d_X, d_Y, d_D and d_Q matrices. If we are able to predict the maximum sizes correctly, we would never need to allocate more than these four buffers, and this can be done fairly easily by adding some checks to ggml_graph_compute. Additionally, to avoid having to re-upload the model weights constantly and use the available VRAM more efficiently, the d_Q buffer could be replaced by a cache. That's what I have been working towards in my cuda-cache branch.

So for what I am doing, there is no need at all to have more buffers, on the contrary, the memory pool code could be simplified further once we know that we will never need to reallocate these buffers. Do you have any specific plans that would benefit from these changes?

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 24, 2023

We could also investigate some more advanced memory management like cudaMallocManaged to have a single pointer in both device and host memory space, we don't need to copy the data manually from one place to the other but just use the prefetch command to make it available on device or host if required.

Performance-wise I don't know if it would make a difference.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Apr 24, 2023

I think a prefetch / cache is certainly the way to go, there is a ton of improvements to the current implementation.

Regarding my variant: It was not tailored to llama, I'm just posting it here as the only ggml project that actually uses cuBLAS currently.
If you use the same 4 buffers each time it won't be necessary but sooner or later another idea will come up and more buffers will be used, or another project will use cuBLAS and need more than a few buffers.

The current implementation would assign a 4GB allocation to a 1MB malloc request if it's the first one available.

So it's more a ggml improvement than a llama improvement.

@slaren
Copy link
Member

slaren commented Apr 24, 2023

@SlyEcho I gave that I try but cudaMemPrefetchAsync doesn't work on my machine and without it the performance is very bad, so at least for me this doesn't seem to be a workable solution.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Apr 24, 2023

@slaren that sucks.

What about cudaHostAlloc() and cudaHostGetDevicePointer()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants