Skip to content

Add Q4_3 quantization (ARM NEON) #1082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 20, 2023
Merged

Add Q4_3 quantization (ARM NEON) #1082

merged 1 commit into from
Apr 20, 2023

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Apr 20, 2023

Initial Q4_3 implementation runs at ~82 ms / token on M1.
Need to see if we can optimize that somehow.

For example Q4_1 runs at ~55 ms / token, so there is probably lots of room for improvement

#define QK4_3 16
typedef struct {
    ggml_fp16_t d;         // delta
    ggml_fp16_t m;         // min
    uint8_t qs[QK4_3 / 2]; // nibbles / quants
} block_q4_3;

Merging this, although the speed is not satisfying. We have to try to get it as fast as Q4_1.
We might have to change the block_q4_3 if needed to achieve this

@ggerganov ggerganov force-pushed the q4_3 branch 2 times, most recently from eed22ae to dff03c0 Compare April 20, 2023 16:51
@ggerganov ggerganov marked this pull request as ready for review April 20, 2023 17:18
@ggerganov ggerganov merged commit e0305ea into master Apr 20, 2023
@ggerganov ggerganov deleted the q4_3 branch April 20, 2023 17:35
Copy link
Collaborator

@prusnak prusnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M1 16 GB benchmark:

7B q4_3 4 threads: 180 ms/token
7B q4_3 8 threads: 280 ms/token

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants