Replies: 2 comments 1 reply
-
Those numbers seem normal for CPU. But I don't see if you use OpenBLAS/Accelerate that would speed up the prompt evaluation. A smaller context will also be faster. I also think that you may be a little too close on running out of memory as well. |
Beta Was this translation helpful? Give feedback.
0 replies
-
@SlyEcho, I read that the problem is in 4bit quantization. For the memory its working fine, only used about 10Gb, do you use GPU for llama.cpp ? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Here's the prompt :
Here's the command :
Here's the result :
Question : why is that llama.cpp very slow at parsing (not so) large prompt ? I have tried using --mlock but no different whatsoever, is there anything I did wrongly ?
My system : Mac mini m2 pro 16gb
TIA
Edit : the slowness I think is in prompt eval time, I just found out that if I'm using a simple prompt, maybe 100ish tokens, it takes some delay ex :
It will wait sometimes before --User: is appear
Beta Was this translation helpful? Give feedback.
All reactions