Description
There appears to be a regression between release versions b2586 and b2589. When attempting to load Mixtral 8x7b models with any version greater than b2586, the system utilizes an abnormal amount of memory compared to previous versions. Manually disabling mmap
does resolve the issue.
Platform:
Windows 11 Pro
64GB RAM
Nvidia 3080
Example command:
.\main.exe -m 'C:\models\dolphin-2.7-mixtral-8x7b.Q5_0.gguf' -p "<|im_start|>user\nHello!\n<|im_end|>\n<|im_start|>assistant\n"
Versions I tested:
b2586 cuda cu12.2.0 & openblas
b2589 cuda cu12.2.0 & openblas
Diffing log output from b2586 cuda cu12.2.0 and b2589 cuda cu12.2.0 shows the following:
b2586: llm_load_tensors: CPU buffer size = 30735.50 MiB
b2589: llm_load_tensors: CUDA_Host buffer size = 30735.50 MiB