Mixtral 8x7b models using more memory while loading

There appears to be a regression between release versions **b2586** and **b2589**. When attempting to load Mixtral 8x7b models with any version greater than **b2586**, the system utilizes an abnormal amount of memory compared to previous versions. Manually disabling `mmap` does resolve the issue.

Platform:
Windows 11 Pro
64GB RAM
Nvidia 3080

Example command:
`.\main.exe -m 'C:\models\dolphin-2.7-mixtral-8x7b.Q5_0.gguf' -p "<|im_start|>user\nHello!\n<|im_end|>\n<|im_start|>assistant\n"`

Versions I tested:
**b2586** cuda cu12.2.0 & openblas
![version b2586 memory loading graph](https://github.com/ggerganov/llama.cpp/assets/10788652/ec80e0e6-8a0b-40af-8630-4684d33d1d21)

**b2589** cuda cu12.2.0 & openblas
![version b2589 memory loading graph cuda & openblas](https://github.com/ggerganov/llama.cpp/assets/10788652/83f001e9-9483-474b-b952-9a38e72cece7)

**b2589** avx512
![version b2589 memory loading graph avx512](https://github.com/ggerganov/llama.cpp/assets/10788652/f0ab7967-7ffb-4003-b138-53f140ab4ca1)

Diffing log output from **b2586** cuda cu12.2.0 and **b2589** cuda cu12.2.0 shows the following:
**b2586**: `llm_load_tensors:        CPU buffer size = 30735.50 MiB`
**b2589**: `llm_load_tensors:  CUDA_Host buffer size = 30735.50 MiB`



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixtral 8x7b models using more memory while loading #6652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mixtral 8x7b models using more memory while loading #6652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions