Skip to content

Multi-thread the Q8_0 quantization in ggml_compute_forward_mul_mat_q_f32() #1081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ggerganov opened this issue Apr 20, 2023 · 1 comment
Closed
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers performance Speed related topics

Comments

@ggerganov
Copy link
Member

This part takes about 10% of the total inference time for 7B and it is currently single-threaded:

https://github.com/ggerganov/llama.cpp/blob/6a9661ea5ad72166b700ae5e87976e4452499dda/ggml.c#L7877-L7884

Try to multi-thread this by splitting the work across rows.
Since the GGML_TASK_INIT currently runs only 1 thread, either:

  • update ggml to support multi-threaded GGML_TASK_INIT
  • move the quantization in GGML_TASK_COMPUTE (might be difficult since no barrier mechanism)
@ggerganov ggerganov added enhancement New feature or request performance Speed related topics labels Apr 20, 2023
@ggerganov ggerganov added the good first issue Good for newcomers label Apr 20, 2023
@ggerganov ggerganov self-assigned this Apr 23, 2023
@ggerganov
Copy link
Member Author

Doing tests with latest code base, the Q8_0 quantization part is quite negligible - not really sure how I measured 10% back when I created the issue, but now I do 2 separate runs: with and without calling quantize_row_q_dot() and the time per token is pretty much the same.

Also, multi-threading it via the second approach only degrades the performance.
The first approach would need more changes and I don't think it is really worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers performance Speed related topics
Projects
None yet
Development

No branches or pull requests

1 participant