Skip to content

Is it possible to run 65B with 32Gb of Ram ? #503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TerraTR opened this issue Mar 25, 2023 · 6 comments
Closed

Is it possible to run 65B with 32Gb of Ram ? #503

TerraTR opened this issue Mar 25, 2023 · 6 comments
Labels
hardware Hardware related model Model specific question Further information is requested

Comments

@TerraTR
Copy link

TerraTR commented Mar 25, 2023

I already quantized my files with this command ./quantize ./ggml-model-f16.bin.X E:\GPThome\LLaMA\llama.cpp-master-31572d9\models\65B\ggml-model-q4_0.bin.X 2 , the first time it reduced my files size from 15.9 to 4.9Gb and when i tried to do it again nothing changed. After i executed this command "./main -m ./models/65B/ggml-model-q4_0.bin -n 128 --interactive-first" and when everything is loaded i enter my prompt, my memory usage goes to 98% (25Gb by main.exe) and i just wait dozens of minutes with nothing that appears heres an example:

**PS E:\GPThome\LLaMA\llama.cpp-master-31572d9> ./main -m ./models/65B/ggml-model-q4_0.bin -n 128 --interactive-first
main: seed = 1679761762
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 8192
llama_model_load: n_mult = 256
llama_model_load: n_head = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size = 2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

main: prompt: ' '
main: number of tokens in prompt = 2
1 -> ''
29871 -> ' '

main: interactive mode on.
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000

== Running in interactive mode. ==

  • Press Ctrl+C to interject at any time.
  • Press Return to return control to LLaMa.
  • If you want to submit another line, end your input in ''.

how to become rich**

@sussyboiiii
Copy link

I ran it and it took about 45GB of RAM and was pretty slow.

@SpeedyCraftah
Copy link

I tried to recently and it's a terrible experience. My computer kept freezing and it made it even hard to move the mouse because of the disk swap (I have a very fast SSD) - not to mention the 1 minute per word or so of performance.

Stick with 30B, it's still good and very capable, and you can actually use your computer while running the model, otherwise upgrade to 64GB.

@anzz1
Copy link
Contributor

anzz1 commented Mar 25, 2023

Yes 4-bit quantized max is 30B for 32GB of RAM. Couldn't use 65B at all. With 3-bit or 2-bit quantization the 65B could fit, however from the data published it seems that for both RTN/GPTQ , 4-bit quantized 30B outperforms 3b/2b 65B.

@gjmulder
Copy link
Collaborator

gjmulder commented Mar 26, 2023

The memory access patterns are very random so trying to use swap is just performance testing the OS paging system and your swap drive.

@gjmulder gjmulder added question Further information is requested hardware Hardware related model Model specific labels Mar 26, 2023
@prusnak prusnak closed this as completed Mar 26, 2023
@prusnak
Copy link
Collaborator

prusnak commented Mar 26, 2023

I think this has been answered, let's close

@slipperybeluga
Copy link

FWIW, I was unable to use quantized 65B even with 64 gb of ram. Still way too slow on ubuntu with an i9-10850k & rtx 3060 and binary compiled w/ BLAS. Giving up and switching to 30B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hardware Hardware related model Model specific question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants