-
Notifications
You must be signed in to change notification settings - Fork 11.5k
faster performance on older machines #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@sibeliu what does |
Not sure why, but on my Mac M1 Pro / 16GB using 4 threads works far better than 8 threads:
|
Here's my benchmark on Apple M1 16 GB:
(ms per token) I think it's good that the default value for |
@prusnak that's because there's 4 performance cores on M1 |
but on my M1 Pro I have 8 cores... |
8, which would be nice to use. With the current setup I'm only using 4 |
It should, but it seems that there is some bottleneck on the M1 Pro that prevents to have better perfs with the 8 threads, resulting in a slower inference when specifyng n=4, not sure why. I will do a better test. |
I'm getting the same reuslts on a 4c/8t i7 skylake on linux (7B model, 4-bit). -t 4 is several times faster than -t 8 |
I guess this is because hyperthreading does not help with running the model? So the number of virtual cores is not important only number of physical cores? |
Upon further testing it seems like if I have anything else using the CPU e.g. having Firefox open and watching a video, -t 8 slows to a crawl while -t 4 is relatively unaffected, but after closing all CPU-consuming programs -t 8 becomes faster than -t 4. |
That looks like the cause. Even though getconf _NPROCESSORS_ONLN says 8, there are only 4 physical cores on my processor. But it is still odd that both -t 4 and 8 utilize only 50% of my available processor. If I launch other apps it goes over 50%. BTW can you think of any way to make the GPU help out? It isn't doing anything at the moment |
This project is CPU only however there's a different one that runs on the GPU. Keep in mind the weights are not compatible between the two projects. https://github.com/oobabooga/text-generation-webui |
Thank you! I'll take a look. I just have the GPU in my macbook, wish I had an A100 or something... |
Interestingly, |
I suspect this is because the inference is memory I/O bottlenecked and not CPU bottlenecked. On my 16 core (32 hyperthread) system
and
So I'm not getting linear scaling by doubling the number of cores. Instructions per second look to be around 0.6 which also confirms this. |
Compute remaining tokens along the way and exit if over
Update github action to support linux and macos asset uploading
Hi @sibeliu, I cannot load the model in my Intel i7 machine. I get:
Any ideas? |
Hi Iraklis,
How much memory do you have? Have you tried one of the smaller quantized versions that has been recently released? Also, what exact shell command are you using to run it?
… On Jun 7, 2023, at 4:08 PM, Iraklis Kourtis ***@***.***> wrote:
On machines with smaller memory and slower processors, it can be useful to reduce the overall number of threads running. For instance on my MacBook Pro Intel i5 16Gb machine, 4 threads is much faster than 8. Try:
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT " -t 4 -n 512
Hi @sibeliu <https://github.com/sibeliu>, I cannot load the model in my Intel i7 machine. I get:
main: build = 635 (5c64a09)
main: seed = 1686146865
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
error loading model: unexpectedly reached end of file
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model
Any ideas?
—
Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVVBGEXLU2HRC4YBOS7Y52LXKCDOFANCNFSM6AAAAAAVXTK6WQ>.
You are receiving this because you were mentioned.
|
I have 16Gb of memory. Here is my command:
|
It looks to me like a corrupt binary. Maybe try re-downloading the model and start from scratch? Sorry I can’t be more helpful
… On Jun 7, 2023, at 10:20 PM, Iraklis Kourtis ***@***.***> wrote:
I have 16Gb of memory. Here is my command:
./main -m ./models/7B/ggml-model-q4_0.bin -p "[PROMPT]" -t 4 -n 512
—
Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVVBGEUT7RYVUCB7TEWF4C3XKDO7RANCNFSM6AAAAAAVXTK6WQ>.
You are receiving this because you were mentioned.
|
Update readme + parse --mlock and --no-mmap
@jyviko: Where did you download If you downloaded it from somewhere, try converting it from
python convert-pth-ggml.py models/7B/ 1
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2 to see if it works:
|
On machines with smaller memory and slower processors, it can be useful to reduce the overall number of threads running. For instance on my MacBook Pro Intel i5 16Gb machine, 4 threads is much faster than 8. Try:
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT " -t 4 -n 512
The text was updated successfully, but these errors were encountered: