-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Fails to load 30B model after quantization #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
build
Compilation issues
Comments
Make sure to recompile quantize and main after pulling an update and quantizing your weights, that one got me too! |
Yep, that was the issue, I am too used to python- I pulled and re-quantized, but I didn't recompile! |
flowgrad
pushed a commit
to flowgrad/llama.cpp
that referenced
this issue
Jun 27, 2023
* Performance optimizations on KV handling (for large context generations) Timings for 300 token generation below: Falcon 40B q6_k before patch: 4.5 tokens/second (220ms /token) Falcon 40B q6_k after patch: 8.2 tokens/second (121ms /token) Falcon 7B 5_1 before patch: 16 tokens/second (60ms / token) Falcon 7B 5_1 after patch: 23.8 tokens/second (42ms / token) So 148% generation speed on 5 bit Falcon 7B and 182% on 6 bit Falcon 40B We still have a significant slowdown: Falcon 7B 5bit on 128 token generation is at 31 tokens/second Falcon 7B 5bit on 128 token generation is at 12 tokens/second Next step is getting rid of repeat2 alltogether which should give another doubling in generation (and more for very large contexts) ---------
Deadsg
pushed a commit
to Deadsg/llama.cpp
that referenced
this issue
Dec 19, 2023
AAbushady
pushed a commit
to AAbushady/llama.cpp
that referenced
this issue
Jan 27, 2024
…ing - Allow use of hip SDK (if installed) dlls on windows (ggml-org#470) * If the rocm/hip sdk is installed on windows, then include the sdk as a potential location to load the hipBlas/rocBlas .dlls from. This allows running koboldcpp.py directly with python after building work on windows without having to build the .exe and run that or copy .dlls around. Co-authored-by: one-lithe-rune <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Trying the 30B model on an M1 MBP, 32GB ram, ran quantification on all 4 outputs of the converstion to ggml, but can't load the model for evaluaiton:
This issue does not happen when I run the 7B model.
The text was updated successfully, but these errors were encountered: