DRAFT: AVX2: Potential additional performance increases #749

SebastianApel · 2023-04-03T20:51:06Z

Hi @sw, @rabidcopy, @ggerganov, @Ameobea, @howard0su and everybody else interested,

Would you be willing to give me some feedback on this PR?

PLEASE NOTE: The PR is still in EARLY DRAFT / experimental and NOT ready to be merged yet. It needs significant cleanup.

However, I personally think it is promising and worth a discussion.

The modifications in this PR change the matrix multiplication from the "dot vector" approach to something similar to "tiled matrix multiplication" approach.

The "tiled matrix multiplication" is supposed to be more cache efficient than the "dot vector" approach.

The good news:

In the "benchmark", I get up to 35k FLOPS with the implementation in this PR. ( 10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654 got ~22k FLOPS).
In "./main", I also get an improvement, but on my system it's not as consistent

My questions are:

Would you be willing execute this branch on your machines and share your results? (see HOWTO below).
Do you think it's worth exploring this direction further?

As I said, the code in it's current form is NOT ready to be merged.

But before I spend more time on it, I would appreciate some feedback/thoughts on your side.

HOWTO run the benchmarks / test cases (Linux only):

Review the script "run_benchmarks.sh"
Execute the script "run_benchmarks.sh" (it executes benchmarks with several tile sizes
If you want: upload the created tar.gz archive with the benchmark results to this thread.

rabidcopy · 2023-04-03T21:46:06Z

Running the benchmarks now. Will share results when it finishes. Edit: Here it is. Several combinations seemed to crash. Here's the top ones to make it easier to parse through which combinations were the fastest on my machine. Going to go with 8x1 and compare speed with current master. Edit: Ah, doesn't really seem to work with thread counts other than 2/4/8 for me without crashing. 6 being the sweet spot on my 6 core/12 thread cpu.

./benchmark-main-threads-2-tilesize-2x2.txt:llama_print_timings:        eval time = 41285.27 ms /    99 runs   (  417.02 ms per run)
./benchmark-main-threads-2-tilesize-4x8.txt:llama_print_timings:        eval time = 41553.63 ms /    99 runs   (  419.73 ms per run)
./benchmark-main-threads-2-tilesize-8x1.txt:llama_print_timings:        eval time = 40661.64 ms /    99 runs   (  410.72 ms per run)
./benchmark-main-threads-2-tilesize-1x8.txt:llama_print_timings:        eval time = 40860.37 ms /    99 runs   (  412.73 ms per run)
./benchmark-main-threads-2-tilesize-4x1.txt:llama_print_timings:        eval time = 40891.76 ms /    99 runs   (  413.05 ms per run)

benchmark-results-023ced9dd49b4aabacdad4eb281af83a-20230403-165819.zip

Ameobea · 2023-04-03T22:44:15Z

I get tons of compiler errors when trying to build this branch. Compiling on Debian Linux. Fails with both GCC 12.2 and Clang 13.

Compiler Output

llama.cpp (tiled_mat_mult) » make                                                                                                                                           /opt/llama.cpp
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  unknown
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC  -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -D EXPERIMENT_TILESIZE_X= -D EXPERIMENT_TILESIZE_Y= -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -D EXPERIMENT_TILESIZE_X= -D EXPERIMENT_TILESIZE_Y= -pthread
I LDFLAGS:
I CC:       cc (Debian 12.2.0-14) 12.2.0
I CXX:      g++ (Debian 12.2.0-14) 12.2.0

cc  -I.              -O3 -std=c11   -fPIC  -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -D EXPERIMENT_TILESIZE_X= -D EXPERIMENT_TILESIZE_Y= -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
ggml.c: In function ‘seap_ggml_vec_dot_q4_0’:
ggml.c:2228:12: error: array type has incomplete element type ‘__m256[]’
 2228 |     __m256 acc[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y]; // = 0; // _mm256_setzero_ps();
      |            ^~~
ggml.c:2228:12: note: declaration of ‘acc’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2230:43: error: expected expression before ‘;’ token
 2230 |     for (int tx=0;tx<EXPERIMENT_TILESIZE_X; tx++) {
      |                                           ^
ggml.c:2231:23: error: ‘ty’ undeclared (first use in this function); did you mean ‘tx’?
 2231 |         for (int ty=0;ty<EXPERIMENT_TILESIZE_Y; ty++) {
      |                       ^~
      |                       tx
ggml.c:2231:23: note: each undeclared identifier is reported only once for each function it appears in
ggml.c:2231:47: error: expected expression before ‘;’ token
 2231 |         for (int ty=0;ty<EXPERIMENT_TILESIZE_Y; ty++) {
      |                                               ^
ggml.c:2272:51: error: expected expression before ‘;’ token
 2272 |             for (int tx=0;tx<EXPERIMENT_TILESIZE_X;tx++) {
      |                                                   ^
ggml.c:2277:25: error: array size missing in ‘x_low_q’
 2277 |                 __m256i x_low_q[EXPERIMENT_TILESIZE_X];
      |                         ^~~~~~~
ggml.c:2278:86: error: ‘x_high_q’ undeclared (first use in this function)
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                                                                                      ^~~~~~~~
ggml.c:2258:9: note: in definition of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2258 |         OUT_HIGH[INDEX_Y] = _mm256_srli_epi16( pre_shift, 4 );        \
      |         ^~~~~~~~
ggml.c:2288:24: error: array type has incomplete element type ‘__m256[]’
 2288 |                 __m256 scale[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                        ^~~~~
ggml.c:2288:24: note: declaration of ‘scale’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2290:26: warning: declaration of ‘ty’ shadows previous non-variable [-Wshadow]
 2290 |                 for (int ty=0;ty<EXPERIMENT_TILESIZE_Y;ty++) {
      |                          ^~
ggml.c:2290:55: error: expected expression before ‘;’ token
 2290 |                 for (int ty=0;ty<EXPERIMENT_TILESIZE_Y;ty++) {
      |                                                       ^
ggml.c:2307:29: error: array type has incomplete element type ‘__m256i[]’
 2307 |                     __m256i y_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2307:29: note: declaration of ‘y_high_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2308:29: error: array type has incomplete element type ‘__m256i[]’
 2308 |                     __m256i y_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~
ggml.c:2308:29: note: declaration of ‘y_low_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2249:23: warning: declaration of ‘tmp’ shadows a previous local [-Wshadow]
 2249 |         const __m128i tmp =                                          \
      |                       ^~~
ggml.c:2310:21: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2310 |                     EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(y[i+u+ty*rowlength_y].qs, y_high_q[tx], y_low_q[tx], ty)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2249:23: note: shadowed declaration is here
 2249 |         const __m128i tmp =                                          \
      |                       ^~~
ggml.c:2278:17: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2253:23: warning: declaration of ‘bytes’ shadows a previous local [-Wshadow]
 2253 |         const __m256i bytes = _mm256_cvtepu8_epi16(tmp);  \
      |                       ^~~~~
ggml.c:2310:21: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2310 |                     EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(y[i+u+ty*rowlength_y].qs, y_high_q[tx], y_low_q[tx], ty)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2253:23: note: shadowed declaration is here
 2253 |         const __m256i bytes = _mm256_cvtepu8_epi16(tmp);  \
      |                       ^~~~~
ggml.c:2278:17: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2256:23: warning: declaration of ‘pre_shift’ shadows a previous local [-Wshadow]
 2256 |         const __m256i pre_shift =                                    \
      |                       ^~~~~~~~~
ggml.c:2310:21: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2310 |                     EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(y[i+u+ty*rowlength_y].qs, y_high_q[tx], y_low_q[tx], ty)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2256:23: note: shadowed declaration is here
 2256 |         const __m256i pre_shift =                                    \
      |                       ^~~~~~~~~
ggml.c:2278:17: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2313:29: error: array type has incomplete element type ‘__m256i[]’
 2313 |                     __m256i xy_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~~
ggml.c:2313:29: note: declaration of ‘xy_high_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2316:29: error: array type has incomplete element type ‘__m256i[]’
 2316 |                     __m256i xy_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2316:29: note: declaration of ‘xy_low_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2320:29: error: array type has incomplete element type ‘__m256i[]’
 2320 |                     __m256i xy_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~
ggml.c:2320:29: note: declaration of ‘xy_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2324:28: error: array type has incomplete element type ‘__m256[]’
 2324 |                     __m256 q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                            ^
ggml.c:2324:28: note: declaration of ‘q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2324:28: warning: unused variable ‘q’ [-Wunused-variable]
ggml.c:2320:29: warning: unused variable ‘xy_q’ [-Wunused-variable]
 2320 |                     __m256i xy_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~
ggml.c:2316:29: warning: unused variable ‘xy_low_q’ [-Wunused-variable]
 2316 |                     __m256i xy_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2313:29: warning: unused variable ‘xy_high_q’ [-Wunused-variable]
 2313 |                     __m256i xy_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~~
ggml.c:2308:29: warning: unused variable ‘y_low_q’ [-Wunused-variable]
 2308 |                     __m256i y_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~
ggml.c:2307:29: warning: unused variable ‘y_high_q’ [-Wunused-variable]
 2307 |                     __m256i y_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2288:24: warning: unused variable ‘scale’ [-Wunused-variable]
 2288 |                 __m256 scale[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                        ^~~~~
ggml.c:2336:43: error: expected expression before ‘;’ token
 2336 |     for (int tx=0;tx<EXPERIMENT_TILESIZE_X;tx++) {
      |                                           ^
ggml.c:2337:47: error: expected expression before ‘;’ token
 2337 |         for (int ty=0;ty<EXPERIMENT_TILESIZE_Y;ty++) {
      |                                               ^
ggml.c:2340:13: error: ‘res’ undeclared (first use in this function)
 2340 |             res = _mm_add_ps( res, _mm256_castps256_ps128( acc[tx][ty] ) );
      |             ^~~
ggml.c:2228:12: warning: unused variable ‘acc’ [-Wunused-variable]
 2228 |     __m256 acc[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y]; // = 0; // _mm256_setzero_ps();
      |            ^~~
ggml.c:2201:11: warning: unused variable ‘sumf’ [-Wunused-variable]
 2201 |     float sumf = 0.0;
      |           ^~~~
ggml.c:2191:66: warning: parameter ‘s’ set but not used [-Wunused-but-set-parameter]
 2191 | static void seap_ggml_vec_dot_q4_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy,
      |                                                 ~~~~~~~~~~~~~~~~~^
ggml.c:2192:65: warning: unused parameter ‘dst_stridelength_x’ [-Wunused-parameter]
 2192 |         const int rowlength_x, const int rowlength_y, const int dst_stridelength_x, const int dst_stridelength_y) {
      |                                                       ~~~~~~~~~~^~~~~~~~~~~~~~~~~~
ggml.c: In function ‘tensor_sum_elements’:
ggml.c:6745:23: warning: unused variable ‘p’ [-Wunused-variable]
 6745 |                 void *p = &((float *) tensor->data)[j*tensor->ne[0]+k];
      |                       ^
In file included from ggml.c:12:
ggml.c: In function ‘ggml_compute_forward_mul_mat_q_f32’:
ggml.c:6944:46: error: expected expression before ‘==’ token
 6944 |     assert((ir1-ir0) % EXPERIMENT_TILESIZE_X == 0);
      |                                              ^~
ggml.c:6946:41: error: expected expression before ‘;’ token
 6946 |     int x_stride = EXPERIMENT_TILESIZE_X;
      |                                         ^
ggml.c:6947:37: error: expected expression before ‘)’ token
 6947 |     if (ne11 < EXPERIMENT_TILESIZE_Y) {
      |                                     ^
ggml.c:6991:41: error: expected expression before ‘)’ token
 6991 |         if (ne11 < EXPERIMENT_TILESIZE_Y) {
      |                                         ^
ggml.c:6994:34: error: ‘ic’ undeclared (first use in this function); did you mean ‘i3’?
 6994 |             for (int64_t ic = 0; ic < ne11; ++ic) {
      |                                  ^~
      |                                  i3
ggml.c:7000:46: error: expected expression before ‘)’ token
 7000 |             if ((ne11 % EXPERIMENT_TILESIZE_Y) !=  0) {
      |                                              ^
ggml.c:7001:31: warning: format ‘%i’ expects argument of type ‘int’, but argument 2 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
 7001 |                 printf("ne11=%i\n",ne11);
      |                              ~^    ~~~~
      |                               |    |
      |                               int  int64_t {aka long int}
      |                              %li
ggml.c:7003:49: error: expected expression before ‘)’ token
 7003 |             assert((ne11 % EXPERIMENT_TILESIZE_Y) ==  0); // make sure we have a multiple of the tilesize
      |                                                 ^
ggml.c:7005:26: warning: declaration of ‘ic’ shadows previous non-variable [-Wshadow]
 7005 |             for (int64_t ic = 0; ic < ne11; ic+=EXPERIMENT_TILESIZE_Y) {
      |                          ^~
ggml.c:7005:70: error: expected expression before ‘)’ token
 7005 |             for (int64_t ic = 0; ic < ne11; ic+=EXPERIMENT_TILESIZE_Y) {
      |                                                                      ^
make: *** [Makefile:145: ggml.o] Error 1

howard0su · 2023-04-03T22:59:28Z

I like your idea. Actually it is more common optimization direction in the bias lib. I can help you run but my main dev box doesn't have AVX2 only AVX. (A very old E5). If you can port to AVX, I will help you collect data for sure.

x02Sylvie · 2023-04-03T23:33:34Z

when compiling on Windows, MSVC says that m128i_u does not exist, I swapped it to m128i and it seems to compile fine (not sure if thats gonna break something or not).

Other than that, I get crashes with thread count other than 8, I usually went with 14 since 16 is somehow slower than 14 threads on my machine

diimdeep · 2023-04-04T05:05:05Z

for mac

sysctl -a machdep.cpu > benchmark-results/cpuinfo.txt
MACHINE_ID=$(ioreg -rd1 -c IOPlatformExpertDevice | awk '/IOPlatformUUID/ { split($0, line, "\""); printf("%s\n", line[4]); }')

benchmark-main-threads-2-tilesize-1x1.txt:llama_print_timings:        eval time = 85992.81 ms /    99 runs   (  868.61 ms per run)
benchmark-main-threads-2-tilesize-1x2.txt:llama_print_timings:        eval time = 41981.91 ms /    99 runs   (  424.06 ms per run)
benchmark-main-threads-2-tilesize-1x8.txt:llama_print_timings:        eval time = 38906.24 ms /    99 runs   (  392.99 ms per run)
benchmark-main-threads-2-tilesize-2x1.txt:llama_print_timings:        eval time = 42767.35 ms /    99 runs   (  431.99 ms per run)
benchmark-main-threads-2-tilesize-2x2.txt:llama_print_timings:        eval time = 37795.00 ms /    99 runs   (  381.77 ms per run)
benchmark-main-threads-2-tilesize-2x8.txt:llama_print_timings:        eval time = 38267.93 ms /    99 runs   (  386.54 ms per run)
benchmark-main-threads-2-tilesize-4x1.txt:llama_print_timings:        eval time = 40852.97 ms /    99 runs   (  412.66 ms per run)
benchmark-main-threads-2-tilesize-4x2.txt:llama_print_timings:        eval time = 38172.83 ms /    99 runs   (  385.58 ms per run)
benchmark-main-threads-2-tilesize-4x8.txt:llama_print_timings:        eval time = 39465.25 ms /    99 runs   (  398.64 ms per run)
benchmark-main-threads-2-tilesize-8x1.txt:llama_print_timings:        eval time = 38391.29 ms /    99 runs   (  387.79 ms per run)
benchmark-main-threads-2-tilesize-8x2.txt:llama_print_timings:        eval time = 37773.38 ms /    99 runs   (  381.55 ms per run)
benchmark-main-threads-2-tilesize-8x8.txt:llama_print_timings:        eval time = 38036.29 ms /    99 runs   (  384.20 ms per run)

benchmark-main-threads-2-tilesize-1x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_qAssertion failed: ((ne11 % EXPER_f32, file ggml.c, line 7003.
benchmark-main-threads-2-tilesize-2x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_qAssertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_q_f32, file ggml.c, line 7003.
benchmark-main-threads-2-tilesize-4x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_qAssertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_q_f32, file ggml.c, line 7003.
benchmark-main-threads-2-tilesize-8x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), functioAssertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_q_f32, file ggml.c, line 7003.

benchmark-results-7CB91888-2F90-5973-B733-F2A92C5F3C3C-20230404-081815.tgz

SebastianApel · 2023-04-04T07:20:49Z

@rabidcopy Thank you for running & sharing.
@x02Sylvie Thank you for trying!

Re: Creashes with thread count 6 and 14

UPDATE: Fixed with 42ad59f

PREVIOUS TEXT:
The current implementation of the tiles requires the matrix size to be a multiple of the tile size. If you divide matrix size by thread count 6, the resulting slice for that thread does not have the property in some cases. Hence the abort on assertion.

~~I assume this can be fixed.~~

Would you be willing to share your eval times on the current master, ideally
a) with 2 threads (so it's comparable to this benchmark) and
b) with your "best" threads count (so we know what you can achieve)?

SebastianApel · 2023-04-04T07:28:51Z

I get tons of compiler errors when trying to build this branch. Compiling on Debian Linux. Fails with both GCC 12.2 and Clang 13.

@Ameobea Thanks for trying! The Makefile expected environment variables that are set in run_benchmark.sh and failed without them.

I added defaults in a33cbbe to the Makefile so you can build from command line without the environment variables.

Would you be willing to re-try?

SebastianApel · 2023-04-04T07:41:50Z

for mac

@diimdeep Thank you for running & sharing. I'll look into the problem you've discovered.

SebastianApel · 2023-04-04T07:43:13Z

I like your idea. Actually it is more common optimization direction in the bias lib. I can help you run but my main dev box doesn't have AVX2 only AVX. (A very old E5). If you can port to AVX, I will help you collect data for sure.

@howard0su Thanks for your feedback. I'll look into it, but I'm not sure how easy/hard an AVX port is, so no promisses :-)

rabidcopy · 2023-04-04T13:25:27Z

Here's the benchmark command on master with 2 threads and then 6 threads.
./main -m ../alpaca-7b-native.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 2 --seed 1

llama_print_timings:        load time =  3407.37 ms
llama_print_timings:      sample time =    64.28 ms /   100 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =  5022.31 ms /    14 tokens (  358.74 ms per token)
llama_print_timings:        eval time = 41312.51 ms /    99 runs   (  417.30 ms per run)
llama_print_timings:       total time = 46921.83 ms

./main -m ../alpaca-7b-native.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 6 --seed 1

llama_print_timings:        load time =  2198.91 ms
llama_print_timings:      sample time =    64.25 ms /   100 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =  2438.03 ms /    14 tokens (  174.14 ms per token)
llama_print_timings:        eval time = 18675.86 ms /    99 runs   (  188.65 ms per run)
llama_print_timings:       total time = 21695.58 ms

Edit: Doing an extensive run through of tile size combinations to see which ones work with 6 threads.

…is not a multiple of TILESIZE_X

SebastianApel · 2023-04-04T14:28:24Z

Here's the benchmark command on master with 2 threads and then 6 threads.

@rabidcopy Awesome, thank you.

I think I have fixed the thread=6 problem with 42ad59f.

rabidcopy · 2023-04-04T15:31:02Z

Well here's my findings, running 1-12 = x and 1-12 = y with 6 threads, only these didn't abort. (done before your fix)

./benchmark-main-threads-6-tilesize-1x9.txt:llama_print_timings:        eval time = 18480.42 ms /    99 runs   (  186.67 ms per run)
./benchmark-main-threads-6-tilesize-1x1.txt:llama_print_timings:        eval time = 18696.10 ms /    99 runs   (  188.85 ms per run)
./benchmark-main-threads-6-tilesize-1x11.txt:llama_print_timings:        eval time = 19354.17 ms /    99 runs   (  195.50 ms per run)
./benchmark-main-threads-6-tilesize-1x8.txt:llama_print_timings:        eval time = 18542.98 ms /    99 runs   (  187.30 ms per run)
./benchmark-main-threads-6-tilesize-1x12.txt:llama_print_timings:        eval time = 18783.52 ms /    99 runs   (  189.73 ms per run)
./benchmark-main-threads-6-tilesize-1x2.txt:llama_print_timings:        eval time = 18442.82 ms /    99 runs   (  186.29 ms per run)
./benchmark-main-threads-6-tilesize-1x10.txt:llama_print_timings:        eval time = 18607.76 ms /    99 runs   (  187.96 ms per run)

ggerganov · 2023-04-13T13:10:38Z

Yes, improving the dot-product based matrix multiplication with a block based approach would be great.
Please open a new PR after rebasing on latest master and provide latest numbers that you obtain before and after the change.

SebastianApel added 3 commits April 3, 2023 13:49

Experimental code that achives 30k FLOPS

dc1c5ae

Working version of tiled implementation

3616322

Add benchmark script

75eea96

SebastianApel marked this pull request as draft April 3, 2023 20:51

Makefile: Added defaults for TILESIZE_X and _Y

a33cbbe

Bugfix: We can handle the situation where matrix rows / thread count …

42ad59f

…is not a multiple of TILESIZE_X

ggerganov closed this Apr 13, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: AVX2: Potential additional performance increases #749

DRAFT: AVX2: Potential additional performance increases #749

SebastianApel commented Apr 3, 2023 •

edited

Loading

rabidcopy commented Apr 3, 2023 •

edited

Loading

Ameobea commented Apr 3, 2023

howard0su commented Apr 3, 2023

x02Sylvie commented Apr 3, 2023

diimdeep commented Apr 4, 2023 •

edited

Loading

SebastianApel commented Apr 4, 2023 •

edited

Loading

SebastianApel commented Apr 4, 2023 •

edited

Loading

SebastianApel commented Apr 4, 2023

SebastianApel commented Apr 4, 2023

rabidcopy commented Apr 4, 2023 •

edited

Loading

SebastianApel commented Apr 4, 2023

rabidcopy commented Apr 4, 2023 •

edited

Loading

ggerganov commented Apr 13, 2023

DRAFT: AVX2: Potential additional performance increases #749

DRAFT: AVX2: Potential additional performance increases #749

Conversation

SebastianApel commented Apr 3, 2023 • edited Loading

rabidcopy commented Apr 3, 2023 • edited Loading

Ameobea commented Apr 3, 2023

howard0su commented Apr 3, 2023

x02Sylvie commented Apr 3, 2023

diimdeep commented Apr 4, 2023 • edited Loading

SebastianApel commented Apr 4, 2023 • edited Loading

SebastianApel commented Apr 4, 2023 • edited Loading

SebastianApel commented Apr 4, 2023

SebastianApel commented Apr 4, 2023

rabidcopy commented Apr 4, 2023 • edited Loading

SebastianApel commented Apr 4, 2023

rabidcopy commented Apr 4, 2023 • edited Loading

ggerganov commented Apr 13, 2023

SebastianApel commented Apr 3, 2023 •

edited

Loading

rabidcopy commented Apr 3, 2023 •

edited

Loading

diimdeep commented Apr 4, 2023 •

edited

Loading

SebastianApel commented Apr 4, 2023 •

edited

Loading

SebastianApel commented Apr 4, 2023 •

edited

Loading

rabidcopy commented Apr 4, 2023 •

edited

Loading

rabidcopy commented Apr 4, 2023 •

edited

Loading