Skip to content

Mixtral 8x7b models using more memory while loading #6652

Closed as not planned
Closed as not planned
@RyenNelsen

Description

@RyenNelsen

There appears to be a regression between release versions b2586 and b2589. When attempting to load Mixtral 8x7b models with any version greater than b2586, the system utilizes an abnormal amount of memory compared to previous versions. Manually disabling mmap does resolve the issue.

Platform:
Windows 11 Pro
64GB RAM
Nvidia 3080

Example command:
.\main.exe -m 'C:\models\dolphin-2.7-mixtral-8x7b.Q5_0.gguf' -p "<|im_start|>user\nHello!\n<|im_end|>\n<|im_start|>assistant\n"

Versions I tested:
b2586 cuda cu12.2.0 & openblas
version b2586 memory loading graph

b2589 cuda cu12.2.0 & openblas
version b2589 memory loading graph cuda & openblas

b2589 avx512
version b2589 memory loading graph avx512

Diffing log output from b2586 cuda cu12.2.0 and b2589 cuda cu12.2.0 shows the following:
b2586: llm_load_tensors: CPU buffer size = 30735.50 MiB
b2589: llm_load_tensors: CUDA_Host buffer size = 30735.50 MiB

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions