Skip to content

4bit 65B model overflow 64GB of RAM #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fgdfgfthgr-fox opened this issue Apr 2, 2023 · 7 comments
Closed

4bit 65B model overflow 64GB of RAM #702

fgdfgfthgr-fox opened this issue Apr 2, 2023 · 7 comments
Assignees
Labels
linux Issues specific to Linux need more info The OP should provide more details about the issue performance Speed related topics

Comments

@fgdfgfthgr-fox
Copy link

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

During inference, there should be no or minimum disk activities going on, and disk should not be a bottleneck once pass the model loading stage.

Current Behavior

My disk should have a continuous reading speed of over 100MB/s, however, during the loading of the model, it only loads at around 40MB/s. After this very slow loading of Llama 65b model (converted from GPTQ with group size of 128), llama.cpp start to inference, however during the inference the programme continue to occupy the disk and reads at 40MB/s. The generation speed is also extremely slow, at around 10 minutes per token.
However, if it's 30b model or smaller, llama.cpp work as expected.

Environment and Context

Note: My interfercing were done using oobabooga's text-generation-webui's implementation of llama.cpp, as I have no idea how to use llama.cpp by itself...

  • Physical (or virtual) hardware you are using, e.g. for Linux:

CPU: Ryzen 5500
Flags:

                         fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n
                         x mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
                         od nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl p
                         ni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2api
                         c movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_le
                         gacy svm extapic cr8_legacy abm sse4a misalignsse 3dnow
                         prefetch osvw ibs skinit wdt tce topoext perfctr_core p
                         erfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw
                         _pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 
                         avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap c
                         lflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cq
                         m_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero 
                         irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm
                         _lock nrip_save tsc_scale vmcb_clean flushbyasid decode
                         assists pausefilter pfthreshold avic v_vmsave_vmload vg
                         if v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid ove
                         rflow_recov succor smca fsrm

RAM: 64GB of DDR4 running at 3000MHz
Disk where I stored my model file: 2 Barraccuda 1TB HDD in Raid 1 configuration
System SSD: NV2 500GB

  • Operating System, e.g. for Linux:

Linux fgdfgfthgr-MS-7C95 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
Python 3.9.13
GNU Make 4.3
g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Failure Information (for bugs)

Not sure what other information is there to provide.

Steps to Reproduce

  1. Load a 65b model using oobabooga's text-generation-webui's implementation of llama.cpp.
  2. Use iostat -y -d 5 to monitor disk activity during loading and inference.

Failure Logs

Llama.cpp version:

https://pypi.org/project/llamacpp/
0.1.11

Pip environment:

accelerate               0.18.0
aiofiles                 23.1.0
aiohttp                  3.8.4
aiosignal                1.3.1
altair                   4.2.2
anyio                    3.6.2
async-timeout            4.0.2
attrs                    22.2.0
bitsandbytes             0.37.2
certifi                  2022.12.7
charset-normalizer       3.1.0
click                    8.1.3
cmake                    3.26.1
contourpy                1.0.7
cycler                   0.11.0
datasets                 2.11.0
dill                     0.3.6
entrypoints              0.4
fastapi                  0.95.0
ffmpy                    0.3.0
filelock                 3.10.7
flexgen                  0.1.7
fonttools                4.39.3
frozenlist               1.3.3
fsspec                   2023.3.0
gradio                   3.24.0
gradio_client            0.0.5
h11                      0.14.0
httpcore                 0.16.3
httpx                    0.23.3
huggingface-hub          0.13.3
idna                     3.4
Jinja2                   3.1.2
jsonschema               4.17.3
kiwisolver               1.4.4
linkify-it-py            2.0.0
lit                      16.0.0
llamacpp                 0.1.11
Markdown                 3.4.3
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
mdit-py-plugins          0.3.3
mdurl                    0.1.2
mpmath                   1.3.0
multidict                6.0.4
multiprocess             0.70.14
networkx                 3.0
numpy                    1.24.2
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
orjson                   3.8.9
packaging                23.0
pandas                   1.5.3
peft                     0.2.0
Pillow                   9.4.0
pip                      23.0.1
psutil                   5.9.4
PuLP                     2.7.0
pyarrow                  11.0.0
pydantic                 1.10.7
pydub                    0.25.1
pyparsing                3.0.9
pyrsistent               0.19.3
python-dateutil          2.8.2
python-multipart         0.0.6
pytz                     2023.3
PyYAML                   6.0
quant-cuda               0.0.0
regex                    2023.3.23
requests                 2.28.2
responses                0.18.0
rfc3986                  1.5.0
rwkv                     0.7.1
safetensors              0.3.0
semantic-version         2.10.0
sentencepiece            0.1.97
setuptools               65.6.3
six                      1.16.0
sniffio                  1.3.0
starlette                0.26.1
sympy                    1.11.1
tokenizers               0.13.2
toolz                    0.12.0
torch                    2.0.0
torchaudio               2.0.1
torchvision              0.15.1
tqdm                     4.65.0
transformers             4.28.0.dev0
triton                   2.0.0
typing_extensions        4.5.0
uc-micro-py              1.0.1
urllib3                  1.26.15
uvicorn                  0.21.1
websockets               10.4
wheel                    0.38.4
xxhash                   3.2.0
yarl                     1.8.2
(textgen) fgdfgfthgr@fgdfgfthgr-MS-7C95:/mnt/7018F20D48B6C548/text-generation-webui$ y-it-py            2.0.0
lit                      16.0.0
llamacpp                 0.1.11
Markdown                 3.4.3
markdown-it-py           2.2.0
MarkupSafe               2.1.2
matplotlib               3.7.1
mdit-py-plugins          0.3.3
mdurl                    0.1.2
mpmath                   1.3.0

md5sum ggml-model-q4_0.bin
3073a8eedd1252063ad9b440af7c90cc ggml-model-q4_1.bin

@jart
Copy link
Contributor

jart commented Apr 2, 2023

There's two issues you're reporting here:

  1. Suboptimal disk throughput during first run loading time. The Linux Kernel sometimes isn't very good at making major page faults go fast in my experience on commodity machines undergoing memory pressure, like for instance, loading 65B on a system with 64GB of RAM. Especially if you have things like X and Chrome open at the same time. You're cutting it pretty tight. It's sort of like if you were asked to dance inside a cage that's just big enough for your body. It wouldn't be a stunning performance.

  2. Inference performance time regression, possibly due to swapping and memory pressure. In this case you may want to consider using the --mlock flag which will force the memory to not be swapped. You have to have root privileges if you do this. Please note that, rather than making LLaMA load slower, this might just nuke your system instead if the memory pressure is indeed that high.

Let me know if any of the above suggestions work for you! We should ideally be able to stretch our RAM budgets as far as possible. Being able to operate in tight constraints is a hallmark of good engineering. So I'd like to see us be able to do as much as possible for you. I just don't know how much we can do.

@jart jart added need more info The OP should provide more details about the issue performance Speed related topics linux Issues specific to Linux labels Apr 2, 2023
@jart jart self-assigned this Apr 2, 2023
@diimdeep
Copy link

diimdeep commented Apr 2, 2023

I observe similar behavior with 78ca983 with 7B model on 8 GB RAM macOS machine (Haswell, 2 core).
If there is nearly not enough free space for model to fully load, performance drops from 600 ms/token down to 3 minutes/token. Instead of CPU bound performance with 0 disk activity, I observe disk bound performance (30-90% CPU instead of ~190% and constant 160MB/s read from disk)

Details
vmmap <pid>

ReadOnly portion of Libraries: Total=414.3M resident=8932K(2%) swapped_out_or_unallocated=405.6M(98%)
Writable regions: Total=2.1G written=2.0G(97%) resident=26.2M(1%) swapped_out=2.0G(96%) unallocated=52.2M(2%)

                                VIRTUAL RESIDENT    DIRTY  SWAPPED VOLATILE   NONVOL    EMPTY   REGION
REGION TYPE                        SIZE     SIZE     SIZE     SIZE     SIZE     SIZE     SIZE    COUNT (non-coalesced)
===========                     ======= ========    =====  ======= ========   ======    =====  =======
Dispatch continuations            8192K       0K       0K     136K       0K       0K       0K        1
Kernel Alloc Once                    8K       0K       0K       4K       0K       0K       0K        1
MALLOC guard page                   16K       0K       0K       0K       0K       0K       0K        4
MALLOC metadata                     44K      44K      44K       0K       0K       0K       0K        5
MALLOC_LARGE                       2.0G    25.4M    25.4M     2.0G       0K       0K       0K       21         see MALLOC ZONE table below
MALLOC_LARGE (empty)              1820K     252K     252K    1568K       0K       0K       0K        6         see MALLOC ZONE table below
MALLOC_LARGE metadata                4K       4K       4K       0K       0K       0K       0K        1         see MALLOC ZONE table below
MALLOC_SMALL                      32.0M      44K      44K      80K       0K       0K       0K        4         see MALLOC ZONE table below
MALLOC_TINY                       4096K      72K      72K    1524K       0K       0K       0K        4         see MALLOC ZONE table below
STACK GUARD                       56.0M       0K       0K       0K       0K       0K       0K        5
Stack                             10.0M     136K     136K      72K       0K       0K       0K        6
__DATA                            1383K     515K     350K     478K       0K       0K       0K       53
__DATA_CONST                        36K       8K       8K      16K       0K       0K       0K        2
__LINKEDIT                       388.8M    3704K       0K       0K       0K       0K       0K        4
__OBJC_RO                         32.3M    17.4M       0K       0K       0K       0K       0K        1
__OBJC_RW                         1908K      96K       0K       8K       0K       0K       0K        2
__TEXT                            25.6M    5228K       0K       0K       0K       0K       0K       54
mapped file                        3.9G     2.2G       0K       0K       0K       0K       0K        1
shared memory                        8K       8K       8K       0K       0K       0K       0K        2
unused but dirty shlib __DATA        4K     1935     1935     2575       0K       0K       0K       14
===========                     ======= ========    =====  ======= ========   ======    =====  =======
TOTAL                              6.5G     2.3G    26.3M     2.0G       0K       0K       0K      191

                                 VIRTUAL   RESIDENT      DIRTY    SWAPPED ALLOCATION      BYTES DIRTY+SWAP          REGION
MALLOC ZONE                         SIZE       SIZE       SIZE       SIZE      COUNT  ALLOCATED  FRAG SIZE  % FRAG   COUNT
===========                      =======  =========  =========  =========  =========  =========  =========  ======  ======
DefaultMallocZone_0x1013c5000       2.0G      25.5M      25.5M       2.0G      32701       2.0G       168K      1%      30



bash -c "~/Downloads/rusage ./main -m ./models/ggml-model-q4_0.bin -n 4 -t 2 -p "I 42 you""
main: seed = 1680443963
llama_model_load: loading model from './models/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 2 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 4, n_keep = 0


 I'm planning on
llama_print_timings:        load time = 33297.68 ms
llama_print_timings:      sample time =     7.26 ms /     4 runs   (    1.81 ms per run)
llama_print_timings: prompt eval time = 31264.39 ms /     2 tokens (15632.20 ms per token)
llama_print_timings:        eval time = 92385.20 ms /     3 runs   (30795.07 ms per run)
llama_print_timings:       total time = 125690.51 ms
RL: took 125,564,060µs wall time
RL: ballooned to 3,429,640kb in size
RL: needed 56,867,096µs cpu (71% kernel)
RL: caused 4,561,725 page faults (12% memcpy)
RL: 1,506,610 context switches (0% consensual)

and here is when it is fully loaded

Details
vmmap <pid>

ReadOnly portion of Libraries: Total=414.3M resident=7576K(2%) swapped_out_or_unallocated=406.9M(98%)
Writable regions: Total=2.0G written=2.0G(98%) resident=289.4M(14%) swapped_out=1.7G(84%) unallocated=34.9M(2%)

                                VIRTUAL RESIDENT    DIRTY  SWAPPED VOLATILE   NONVOL    EMPTY   REGION
REGION TYPE                        SIZE     SIZE     SIZE     SIZE     SIZE     SIZE     SIZE    COUNT (non-coalesced)
===========                     ======= ========    =====  ======= ========   ======    =====  =======
Kernel Alloc Once                    8K       4K       4K       0K       0K       0K       0K        1
MALLOC guard page                   16K       0K       0K       0K       0K       0K       0K        4
MALLOC metadata                     44K      44K      44K       0K       0K       0K       0K        5
MALLOC_LARGE                       2.0G   287.3M   287.3M     1.7G       0K       0K       0K       21         see MALLOC ZONE table below
MALLOC_LARGE (empty)              1328K     728K     728K     600K       0K       0K       0K        3         see MALLOC ZONE table below
MALLOC_LARGE metadata                4K       4K       4K       0K       0K       0K       0K        1         see MALLOC ZONE table below
MALLOC_SMALL                      16.0M      28K      28K       8K       0K       0K       0K        2         see MALLOC ZONE table below
MALLOC_SMALL (empty)              8192K       4K       4K      48K       0K       0K       0K        1         see MALLOC ZONE table below
MALLOC_TINY                       4096K     888K     888K     708K       0K       0K       0K        4         see MALLOC ZONE table below
STACK GUARD                       56.0M       0K       0K       0K       0K       0K       0K        2
Stack                             8712K     120K     120K       0K       0K       0K       0K        3
__DATA                            1383K     578K     382K     438K       0K       0K       0K       53
__DATA_CONST                        36K      16K       8K      16K       0K       0K       0K        2
__LINKEDIT                       388.8M    2256K       0K       0K       0K       0K       0K        4
__OBJC_RO                         32.3M    20.3M       0K       0K       0K       0K       0K        1
__OBJC_RW                         1908K     376K       0K       8K       0K       0K       0K        2
__TEXT                            25.6M    5320K       0K       0K       0K       0K       0K       54
mapped file                        3.9G     3.8G       0K       0K       0K       0K       0K        1
shared memory                        8K       8K       8K       0K       0K       0K       0K        2
unused but dirty shlib __DATA        4K     2150     2150     2320       0K       0K       0K       13
===========                     ======= ========    =====  ======= ========   ======    =====  =======
TOTAL                              6.5G     4.2G   289.5M     1.7G       0K       0K       0K      179

                                 VIRTUAL   RESIDENT      DIRTY    SWAPPED ALLOCATION      BYTES DIRTY+SWAP          REGION
MALLOC ZONE                         SIZE       SIZE       SIZE       SIZE      COUNT  ALLOCATED  FRAG SIZE  % FRAG   COUNT
===========                      =======  =========  =========  =========  =========  =========  =========  ======  ======
DefaultMallocZone_0x10e173000       2.0G     288.2M     288.2M       1.7G      32698       2.0G       139K      1%      29



bash -c "~/Downloads/rusage ./main -m ./models/ggml-model-q4_0.bin -s 42 --ignore-eos --keep -1 -n 16 -t 2 -p \"I love you so much that I would rather die than live without you.\""
main: seed = 42
llama_model_load: loading model from './models/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/ggml-model-q4_0.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 2 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 16, n_keep = 16


 I love you so much that I would rather die than live without you.
I have to admit, this is one of my favourite quotes – and
llama_print_timings:        load time = 10063.17 ms
llama_print_timings:      sample time =    22.49 ms /    16 runs   (    1.41 ms per run)
llama_print_timings: prompt eval time = 12303.84 ms /    16 tokens (  768.99 ms per token)
llama_print_timings:        eval time =  9237.18 ms /    15 runs   (  615.81 ms per run)
llama_print_timings:       total time = 23651.55 ms
RL: took 23,724,737µs wall time
RL: ballooned to 4,296,232kb in size
RL: needed 40,670,457µs cpu (7% kernel)
RL: caused 1,541,050 page faults (94% memcpy)
RL: 58,213 context switches (0% consensual)

@jart
Copy link
Contributor

jart commented Apr 2, 2023

Thank you for sharing such rich technical details @diimdeep. Have you evaluated our --mlock flag under these conditions? Under memory pressure, obviously something on the system is going to have to pay, and using --mlock can help ensure it isn't LLaMA. But it's an unusual situation to both test and find oneself in. So it'd be great to hear feedback on whether or not that flag is helping.

@diimdeep
Copy link

diimdeep commented Apr 2, 2023

Yeah --mlock helps in that regard, when you using other apps while model is running and memory is scarce. But had to tweak system to make it work

sudo sysctl vm.global_no_user_wire_amount=1319282340 vm.user_wire_limit=7270652252 vm.global_user_wire_limit=7270652252

vm.global_no_user_wire_amount: 2319282340 -> 1319282340
vm.user_wire_limit: 6270652252 -> 7270652252
vm.global_user_wire_limit: 7270652252 -> 7270652252

@fgdfgfthgr-fox
Copy link
Author

fgdfgfthgr-fox commented Apr 3, 2023

Let me know if any of the above suggestions work for you! We should ideally be able to stretch our RAM budgets as far as possible. Being able to operate in tight constraints is a hallmark of good engineering. So I'd like to see us be able to do as much as possible for you. I just don't know how much we can do.
@jart

2023-04-03 13-11-30屏幕截图
Just checked, the system did running out of RAM when loading 65B model, and hence the bad performance. However, 65B model should only require around 38.5 GB of RAM, and I got 64GB here. 25.5GB of extra RAM should be plenty for any reasonable overhead + firefox + linux kernel.
note: there were no background programmes running in the screenshot above.

@fgdfgfthgr-fox
Copy link
Author

And I doubt adding --mlock would help in that regard, as adding it doesn't change the fact that most RAM is used up by buffer than the cache. Maybe there is a bug in webui's code or llama.cpp's code causing the buffer usage?

@fgdfgfthgr-fox fgdfgfthgr-fox changed the title Disk bottleneck in 65B model 65B model overflow 64GB of RAM Apr 3, 2023
@fgdfgfthgr-fox fgdfgfthgr-fox changed the title 65B model overflow 64GB of RAM 4bit 65B model overflow 64GB of RAM Apr 4, 2023
@fgdfgfthgr-fox
Copy link
Author

Seems to be fixed at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linux Issues specific to Linux need more info The OP should provide more details about the issue performance Speed related topics
Projects
None yet
Development

No branches or pull requests

3 participants