Try to use quantized `ggml_mul_mat` in attention layer

The following 2 matrix multiplication calls sill remain in FP16 precission:

- https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1135-L1137
- https://github.com/ggerganov/llama.cpp/blob/d40fded93e1a533e969768e1e335c15c61c296ce/llama.cpp#L1158-L1160

Was wondering, if we quantize those on-the-fly would there be any benefit.
The quantization can be done with an extra `ggml_cpy()` call, before the `ggml_mul_mat()` call.

See if this speeds up the computation and how it affects perplexity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Try to use quantized `ggml_mul_mat` in attention layer #1098

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Try to use quantized ggml_mul_mat in attention layer #1098

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Try to use quantized `ggml_mul_mat` in attention layer #1098