-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Is it possible to run 65B with 32Gb of Ram ? #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I ran it and it took about 45GB of RAM and was pretty slow. |
I tried to recently and it's a terrible experience. My computer kept freezing and it made it even hard to move the mouse because of the disk swap (I have a very fast SSD) - not to mention the 1 minute per word or so of performance. Stick with 30B, it's still good and very capable, and you can actually use your computer while running the model, otherwise upgrade to 64GB. |
Yes 4-bit quantized max is 30B for 32GB of RAM. Couldn't use 65B at all. With 3-bit or 2-bit quantization the 65B could fit, however from the data published it seems that for both RTN/GPTQ , 4-bit quantized 30B outperforms 3b/2b 65B. |
The memory access patterns are very random so trying to use swap is just performance testing the OS paging system and your swap drive. |
I think this has been answered, let's close |
FWIW, I was unable to use quantized 65B even with 64 gb of ram. Still way too slow on ubuntu with an i9-10850k & rtx 3060 and binary compiled w/ BLAS. Giving up and switching to 30B. |
I already quantized my files with this command ./quantize ./ggml-model-f16.bin.X E:\GPThome\LLaMA\llama.cpp-master-31572d9\models\65B\ggml-model-q4_0.bin.X 2 , the first time it reduced my files size from 15.9 to 4.9Gb and when i tried to do it again nothing changed. After i executed this command "./main -m ./models/65B/ggml-model-q4_0.bin -n 128 --interactive-first" and when everything is loaded i enter my prompt, my memory usage goes to 98% (25Gb by main.exe) and i just wait dozens of minutes with nothing that appears heres an example:
**PS E:\GPThome\LLaMA\llama.cpp-master-31572d9> ./main -m ./models/65B/ggml-model-q4_0.bin -n 128 --interactive-first
main: seed = 1679761762
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 8192
llama_model_load: n_mult = 256
llama_model_load: n_head = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size = 2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: prompt: ' '
main: number of tokens in prompt = 2
1 -> ''
29871 -> ' '
main: interactive mode on.
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
== Running in interactive mode. ==
how to become rich**
The text was updated successfully, but these errors were encountered: