Skip to content

Try to use quantized ggml_mul_mat in attention layer #1098

@ggerganov

Description

@ggerganov

The following 2 matrix multiplication calls sill remain in FP16 precission:

Was wondering, if we quantize those on-the-fly would there be any benefit.
The quantization can be done with an extra ggml_cpy() call, before the ggml_mul_mat() call.

See if this speeds up the computation and how it affects perplexity

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions