Skip to content

Fails to load 30B model after quantization #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MLTQ opened this issue Mar 11, 2023 · 2 comments
Closed

Fails to load 30B model after quantization #27

MLTQ opened this issue Mar 11, 2023 · 2 comments
Labels
build Compilation issues

Comments

@MLTQ
Copy link

MLTQ commented Mar 11, 2023

Trying the 30B model on an M1 MBP, 32GB ram, ran quantification on all 4 outputs of the converstion to ggml, but can't load the model for evaluaiton:

llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 6656
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 17920
llama_model_load: ggml ctx size = 20951.50 MB
llama_model_load: memory_size =  1560.00 MB, n_mem = 30720
llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from './models/30B/ggml-model-q4_0.bin'
llama_model_load: %

This issue does not happen when I run the 7B model.

@djkz
Copy link

djkz commented Mar 12, 2023

Make sure to recompile quantize and main after pulling an update and quantizing your weights, that one got me too!

@MLTQ
Copy link
Author

MLTQ commented Mar 12, 2023

Yep, that was the issue, I am too used to python- I pulled and re-quantized, but I didn't recompile!

@MLTQ MLTQ closed this as completed Mar 12, 2023
@gjmulder gjmulder added build Compilation issues model Model specific and removed model Model specific labels Mar 15, 2023
flowgrad pushed a commit to flowgrad/llama.cpp that referenced this issue Jun 27, 2023
* Performance optimizations on KV handling
(for large context generations)
Timings for 300 token generation below:
Falcon 40B q6_k before patch: 4.5 tokens/second (220ms /token)
Falcon 40B q6_k after patch: 8.2 tokens/second (121ms /token)
Falcon 7B 5_1 before patch: 16 tokens/second (60ms / token)
Falcon 7B 5_1 after patch: 23.8 tokens/second (42ms / token)

So 148% generation speed on 5 bit Falcon 7B and 182% on 6 bit Falcon 40B

We still have a significant slowdown:
Falcon 7B 5bit on 128 token generation is at 31 tokens/second
Falcon 7B 5bit on 128 token generation is at 12 tokens/second

Next step is getting rid of repeat2 alltogether which should give another doubling in generation (and more for very large contexts)

---------
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
AAbushady pushed a commit to AAbushady/llama.cpp that referenced this issue Jan 27, 2024
…ing - Allow use of hip SDK (if installed) dlls on windows (ggml-org#470)

* If the rocm/hip sdk is installed on windows, then include the sdk
as a potential location to load the hipBlas/rocBlas .dlls from. This
allows running koboldcpp.py directly with python after building
work on windows without having to build the .exe and run that or
copy .dlls around.

Co-authored-by: one-lithe-rune <[email protected]>
zkh2016 pushed a commit to zkh2016/llama.cpp that referenced this issue Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues
Projects
None yet
Development

No branches or pull requests

3 participants