-
Notifications
You must be signed in to change notification settings - Fork 11.5k
4bit 65B model overflow 64GB of RAM #702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There's two issues you're reporting here:
Let me know if any of the above suggestions work for you! We should ideally be able to stretch our RAM budgets as far as possible. Being able to operate in tight constraints is a hallmark of good engineering. So I'd like to see us be able to do as much as possible for you. I just don't know how much we can do. |
I observe similar behavior with 78ca983 with 7B model on 8 GB RAM macOS machine (Haswell, 2 core). Details
and here is when it is fully loaded Details
|
Thank you for sharing such rich technical details @diimdeep. Have you evaluated our |
Yeah
|
|
And I doubt adding --mlock would help in that regard, as adding it doesn't change the fact that most RAM is used up by buffer than the cache. Maybe there is a bug in webui's code or llama.cpp's code causing the buffer usage? |
Seems to be fixed at this point. |
Prerequisites
I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
During inference, there should be no or minimum disk activities going on, and disk should not be a bottleneck once pass the model loading stage.
Current Behavior
My disk should have a continuous reading speed of over 100MB/s, however, during the loading of the model, it only loads at around 40MB/s. After this very slow loading of Llama 65b model (converted from GPTQ with group size of 128), llama.cpp start to inference, however during the inference the programme continue to occupy the disk and reads at 40MB/s. The generation speed is also extremely slow, at around 10 minutes per token.
However, if it's 30b model or smaller, llama.cpp work as expected.
Environment and Context
Note: My interfercing were done using oobabooga's text-generation-webui's implementation of llama.cpp, as I have no idea how to use llama.cpp by itself...
CPU: Ryzen 5500
Flags:
RAM: 64GB of DDR4 running at 3000MHz
Disk where I stored my model file: 2 Barraccuda 1TB HDD in Raid 1 configuration
System SSD: NV2 500GB
Linux fgdfgfthgr-MS-7C95 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Failure Information (for bugs)
Not sure what other information is there to provide.
Steps to Reproduce
iostat -y -d 5
to monitor disk activity during loading and inference.Failure Logs
Llama.cpp version:
Pip environment:
md5sum ggml-model-q4_0.bin
3073a8eedd1252063ad9b440af7c90cc ggml-model-q4_1.bin
The text was updated successfully, but these errors were encountered: