Fails to load 30B model after quantization #27

MLTQ · 2023-03-11T22:35:55Z

Trying the 30B model on an M1 MBP, 32GB ram, ran quantification on all 4 outputs of the converstion to ggml, but can't load the model for evaluaiton:

llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 6656
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 52
llama_model_load: n_layer = 60
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 17920
llama_model_load: ggml ctx size = 20951.50 MB
llama_model_load: memory_size =  1560.00 MB, n_mem = 30720
llama_model_load: tensor 'tok_embeddings.weight' has wrong size in model file
main: failed to load model from './models/30B/ggml-model-q4_0.bin'
llama_model_load: %

This issue does not happen when I run the 7B model.

The text was updated successfully, but these errors were encountered:

djkz · 2023-03-12T00:39:19Z

Make sure to recompile quantize and main after pulling an update and quantizing your weights, that one got me too!

MLTQ · 2023-03-12T00:45:38Z

Yep, that was the issue, I am too used to python- I pulled and re-quantized, but I didn't recompile!

* Performance optimizations on KV handling (for large context generations) Timings for 300 token generation below: Falcon 40B q6_k before patch: 4.5 tokens/second (220ms /token) Falcon 40B q6_k after patch: 8.2 tokens/second (121ms /token) Falcon 7B 5_1 before patch: 16 tokens/second (60ms / token) Falcon 7B 5_1 after patch: 23.8 tokens/second (42ms / token) So 148% generation speed on 5 bit Falcon 7B and 182% on 6 bit Falcon 40B We still have a significant slowdown: Falcon 7B 5bit on 128 token generation is at 31 tokens/second Falcon 7B 5bit on 128 token generation is at 12 tokens/second Next step is getting rid of repeat2 alltogether which should give another doubling in generation (and more for very large contexts) ---------

…ing - Allow use of hip SDK (if installed) dlls on windows (ggml-org#470) * If the rocm/hip sdk is installed on windows, then include the sdk as a potential location to load the hipBlas/rocBlas .dlls from. This allows running koboldcpp.py directly with python after building work on windows without having to build the .exe and run that or copy .dlls around. Co-authored-by: one-lithe-rune <[email protected]>

sync master

MLTQ closed this as completed Mar 12, 2023

gjmulder added build Compilation issues model Model specific and removed model Model specific labels Mar 15, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Make Llama instance pickleable. Closes ggml-org#27

e96a5c5

zkh2016 pushed a commit to zkh2016/llama.cpp that referenced this issue Oct 18, 2024

Merge pull request ggml-org#27 from OpenBMB/master

831a2a5

sync master

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to load 30B model after quantization #27

Fails to load 30B model after quantization #27

MLTQ commented Mar 11, 2023 •

edited

Loading

djkz commented Mar 12, 2023

MLTQ commented Mar 12, 2023 •

edited

Loading

Fails to load 30B model after quantization #27

Fails to load 30B model after quantization #27

Comments

MLTQ commented Mar 11, 2023 • edited Loading

djkz commented Mar 12, 2023

MLTQ commented Mar 12, 2023 • edited Loading

MLTQ commented Mar 11, 2023 •

edited

Loading

MLTQ commented Mar 12, 2023 •

edited

Loading