Name and Version
version: 8790 (be76dd0)
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server, llama-bench
Command line
llama-bench --model $MODEL -p 0 -n 128 -ngl 0
Problem description & steps to reproduce
The nix flake builds a binary without certain CPU optimizations, resulting in significantly lower performance. With my benchmark I went from roughly 6tps to 1.6tps for a 13B model fully on the CPU.
To reproduce:
- Clone the repository
- Build with
nix build . (a CPU only build)
- Test inference speed with
./result/bin/llama-bench --model $MODEL -p 0 -n 128 -ngl 0
To demonstrate the potential speed up:
- Enter the nix development shell with
nix develop .
- Query the nix build flags and add CPU optimization:
export FLAGS=$(echo $(nix log . | grep "cmake flags" | cut -c 14-) \
-DGGML_SSE42=ON \
-DGGML_AVX=ON \
-DGGML_AVX2=ON \
-DGGML_FMA=ON \
-DGGML_F16C=ON \
-DGGML_BMI2=ON)
- Build with
mkdir build; cmake -S . -B build $FLAGS; cmake --build build --config Release -- -j 16 (number of threads CPU dependent)
- Rerun test with
LD_LIBRARY_PATH="$PWD/build/bin" ./build/bin/llama-bench --model $MODEL -p 0 -n 128 -ngl 0 and observe 3-4x speedup.
I tested this on a system with a Ryzen 5800x. I got these specific extra compile flags by comparing the CMakeCache.txt files of a manual build following the build page (which is fast) with the file generated by the build procedure within the nix develop shell.
I don't think simply adding these cmake flags to the flake build in .devops/nix/package.nix is a good way to fix this. As I understand it this could cause the build or binary to fail for other CPUs, but I don't have enough experience with multi platform development for a good evaluation. At the same time the current state means significantly lower performace for really no good reason on modern systems, which users don't even realize unless they switch between the nix flake build and a normal build as I did. Perhaps CPU optimizations could be added as optional versions of the existing packages, with the agnostic version remaining the default to ensure the normal build always works.
Disclaimer: I used generative AI to help narrow down the specific issue and find the relevant difference in my builds. This report was written and tested by me.
First Bad Commit
I am reasonably certain that the change in #11317 is responsible for CPU optimizations being disabled, but did not test this further.
Relevant log output
Benchmark with CPU optimization
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 13B IQ4_NL - 4.5 bpw | 6.60 GiB | 12.25 B | BLAS | 8 | tg128 | 5.98 ± 0.01 |
Benchmark without CPU optimization
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| llama 13B IQ4_NL - 4.5 bpw | 6.60 GiB | 12.25 B | BLAS | 8 | tg128 | 1.60 ± 0.00 |
Name and Version
version: 8790 (be76dd0)
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server, llama-bench
Command line
llama-bench --model $MODEL -p 0 -n 128 -ngl 0Problem description & steps to reproduce
The nix flake builds a binary without certain CPU optimizations, resulting in significantly lower performance. With my benchmark I went from roughly 6tps to 1.6tps for a 13B model fully on the CPU.
To reproduce:
nix build .(a CPU only build)./result/bin/llama-bench --model $MODEL -p 0 -n 128 -ngl 0To demonstrate the potential speed up:
nix develop .mkdir build; cmake -S . -B build $FLAGS; cmake --build build --config Release -- -j 16(number of threads CPU dependent)LD_LIBRARY_PATH="$PWD/build/bin" ./build/bin/llama-bench --model $MODEL -p 0 -n 128 -ngl 0and observe 3-4x speedup.I tested this on a system with a Ryzen 5800x. I got these specific extra compile flags by comparing the CMakeCache.txt files of a manual build following the build page (which is fast) with the file generated by the build procedure within the nix develop shell.
I don't think simply adding these cmake flags to the flake build in
.devops/nix/package.nixis a good way to fix this. As I understand it this could cause the build or binary to fail for other CPUs, but I don't have enough experience with multi platform development for a good evaluation. At the same time the current state means significantly lower performace for really no good reason on modern systems, which users don't even realize unless they switch between the nix flake build and a normal build as I did. Perhaps CPU optimizations could be added as optional versions of the existing packages, with the agnostic version remaining the default to ensure the normal build always works.Disclaimer: I used generative AI to help narrow down the specific issue and find the relevant difference in my builds. This report was written and tested by me.
First Bad Commit
I am reasonably certain that the change in #11317 is responsible for CPU optimizations being disabled, but did not test this further.
Relevant log output
Benchmark with CPU optimization
Benchmark without CPU optimization