-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Add AVX2 implementation of dequantize_row_q4_0 #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A quick performance test shows significant improvement in the function itself (with k=4096):
|
The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values. |
@ggerganov we need some sort of benchmarking suite for ggml. @slaren how complex is the |
It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7 |
The
Otherwise, they will never be called. |
@ggerganov that's not what I am seeing, here is a stack trace for example:
|
Ah yes - there is one exception -- the |
Ah I see. I am running some tests with BLAS now, will report back when I have some results. Unfortunately it seems to be much slower, probably need to find a better BLAS library than just using the libopenblas-dev package from ubuntu.. |
@ggerganov When building with BLAS, -b 32 and a long enough prompt I only get garbage generation (not just bad, but random tokens). This happens on master too. Is it possible that BLAS support is broken at the moment? |
Yes, it is broken. Weird .. |
Ok, BLAS has been fixed and for large prompts and batch size ( > 256) there is significant benefit to enable BLAS. |
I am seeing a very significant improvement with x86 as well, for instance the perplexity computation went from ~8 hours to ~5 hours. |
I couldn't notice a big performance improvement, more testing necessary