Closed
Description
Both the ggml-model-q4_0
and ggml-model-f16
produce a garbage output on my M1 Air 8GB, using the 7B LLaMA model. I've seen the quantized model having problems but I doubt the quantization is the issue as the non-quantized model produces the same output.
➜ llama.cpp git:(master) ./main -m ./models/7B/ggml-model-f16.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
main: seed = 1678812348
llama_model_load: loading model from './models/7B/ggml-model-f16.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 1
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 13365.09 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-f16.bin'
llama_model_load: ........... done
llama_model_load: model size = 4274.30 MB / num tensors = 90
system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
1 -> ''
8893 -> 'Build'
292 -> 'ing'
263 -> ' a'
4700 -> ' website'
508 -> ' can'
367 -> ' be'
2309 -> ' done'
297 -> ' in'
29871 -> ' '
29896 -> '1'
29900 -> '0'
2560 -> ' simple'
6576 -> ' steps'
29901 -> ':'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Building a website can be done in 10 simple steps:Administrationistrunkoveryabasepair tou cross deprecatedinition holes prvindor^C