-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Closed
Labels
good first issueGood for newcomersGood for newcomersperformanceSpeed related topicsSpeed related topics
Description
The following 2 matrix multiplication calls sill remain in FP16 precission:
- https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1135-L1137
- https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1158-L1160
Was wondering, if we quantize those on-the-fly would there be any benefit.
The quantization can be done with an extra ggml_cpy() call, before the ggml_mul_mat() call.
See if this speeds up the computation and how it affects perplexity
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomersperformanceSpeed related topicsSpeed related topics