Skip to content

DRAFT: AVX2: Potential additional performance increases #749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

SebastianApel
Copy link
Contributor

@SebastianApel SebastianApel commented Apr 3, 2023

Hi @sw, @rabidcopy, @ggerganov, @Ameobea, @howard0su and everybody else interested,

Would you be willing to give me some feedback on this PR?

PLEASE NOTE: The PR is still in EARLY DRAFT / experimental and NOT ready to be merged yet. It needs significant cleanup.

However, I personally think it is promising and worth a discussion.

The modifications in this PR change the matrix multiplication from the "dot vector" approach to something similar to "tiled matrix multiplication" approach.

The "tiled matrix multiplication" is supposed to be more cache efficient than the "dot vector" approach.

The good news:

My questions are:

  • Would you be willing execute this branch on your machines and share your results? (see HOWTO below).
  • Do you think it's worth exploring this direction further?

As I said, the code in it's current form is NOT ready to be merged.

But before I spend more time on it, I would appreciate some feedback/thoughts on your side.

HOWTO run the benchmarks / test cases (Linux only):

  • Review the script "run_benchmarks.sh"
  • Execute the script "run_benchmarks.sh" (it executes benchmarks with several tile sizes
  • If you want: upload the created tar.gz archive with the benchmark results to this thread.

@SebastianApel SebastianApel marked this pull request as draft April 3, 2023 20:51
@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 3, 2023

Running the benchmarks now. Will share results when it finishes. Edit: Here it is. Several combinations seemed to crash. Here's the top ones to make it easier to parse through which combinations were the fastest on my machine. Going to go with 8x1 and compare speed with current master. Edit: Ah, doesn't really seem to work with thread counts other than 2/4/8 for me without crashing. 6 being the sweet spot on my 6 core/12 thread cpu.

./benchmark-main-threads-2-tilesize-2x2.txt:llama_print_timings:        eval time = 41285.27 ms /    99 runs   (  417.02 ms per run)
./benchmark-main-threads-2-tilesize-4x8.txt:llama_print_timings:        eval time = 41553.63 ms /    99 runs   (  419.73 ms per run)
./benchmark-main-threads-2-tilesize-8x1.txt:llama_print_timings:        eval time = 40661.64 ms /    99 runs   (  410.72 ms per run)
./benchmark-main-threads-2-tilesize-1x8.txt:llama_print_timings:        eval time = 40860.37 ms /    99 runs   (  412.73 ms per run)
./benchmark-main-threads-2-tilesize-4x1.txt:llama_print_timings:        eval time = 40891.76 ms /    99 runs   (  413.05 ms per run)

benchmark-results-023ced9dd49b4aabacdad4eb281af83a-20230403-165819.zip

@Ameobea
Copy link
Contributor

Ameobea commented Apr 3, 2023

I get tons of compiler errors when trying to build this branch. Compiling on Debian Linux. Fails with both GCC 12.2 and Clang 13.

Compiler Output
llama.cpp (tiled_mat_mult) » make                                                                                                                                           /opt/llama.cpp
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  unknown
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC  -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -D EXPERIMENT_TILESIZE_X= -D EXPERIMENT_TILESIZE_Y= -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -D EXPERIMENT_TILESIZE_X= -D EXPERIMENT_TILESIZE_Y= -pthread
I LDFLAGS:
I CC:       cc (Debian 12.2.0-14) 12.2.0
I CXX:      g++ (Debian 12.2.0-14) 12.2.0

cc  -I.              -O3 -std=c11   -fPIC  -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -D EXPERIMENT_TILESIZE_X= -D EXPERIMENT_TILESIZE_Y= -pthread -march=native -mtune=native   -c ggml.c -o ggml.o
ggml.c: In function ‘seap_ggml_vec_dot_q4_0’:
ggml.c:2228:12: error: array type has incomplete element type ‘__m256[]’
 2228 |     __m256 acc[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y]; // = 0; // _mm256_setzero_ps();
      |            ^~~
ggml.c:2228:12: note: declaration of ‘acc’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2230:43: error: expected expression before ‘;’ token
 2230 |     for (int tx=0;tx<EXPERIMENT_TILESIZE_X; tx++) {
      |                                           ^
ggml.c:2231:23: error: ‘ty’ undeclared (first use in this function); did you mean ‘tx’?
 2231 |         for (int ty=0;ty<EXPERIMENT_TILESIZE_Y; ty++) {
      |                       ^~
      |                       tx
ggml.c:2231:23: note: each undeclared identifier is reported only once for each function it appears in
ggml.c:2231:47: error: expected expression before ‘;’ token
 2231 |         for (int ty=0;ty<EXPERIMENT_TILESIZE_Y; ty++) {
      |                                               ^
ggml.c:2272:51: error: expected expression before ‘;’ token
 2272 |             for (int tx=0;tx<EXPERIMENT_TILESIZE_X;tx++) {
      |                                                   ^
ggml.c:2277:25: error: array size missing in ‘x_low_q’
 2277 |                 __m256i x_low_q[EXPERIMENT_TILESIZE_X];
      |                         ^~~~~~~
ggml.c:2278:86: error: ‘x_high_q’ undeclared (first use in this function)
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                                                                                      ^~~~~~~~
ggml.c:2258:9: note: in definition of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2258 |         OUT_HIGH[INDEX_Y] = _mm256_srli_epi16( pre_shift, 4 );        \
      |         ^~~~~~~~
ggml.c:2288:24: error: array type has incomplete element type ‘__m256[]’
 2288 |                 __m256 scale[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                        ^~~~~
ggml.c:2288:24: note: declaration of ‘scale’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2290:26: warning: declaration of ‘ty’ shadows previous non-variable [-Wshadow]
 2290 |                 for (int ty=0;ty<EXPERIMENT_TILESIZE_Y;ty++) {
      |                          ^~
ggml.c:2290:55: error: expected expression before ‘;’ token
 2290 |                 for (int ty=0;ty<EXPERIMENT_TILESIZE_Y;ty++) {
      |                                                       ^
ggml.c:2307:29: error: array type has incomplete element type ‘__m256i[]’
 2307 |                     __m256i y_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2307:29: note: declaration of ‘y_high_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2308:29: error: array type has incomplete element type ‘__m256i[]’
 2308 |                     __m256i y_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~
ggml.c:2308:29: note: declaration of ‘y_low_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2249:23: warning: declaration of ‘tmp’ shadows a previous local [-Wshadow]
 2249 |         const __m128i tmp =                                          \
      |                       ^~~
ggml.c:2310:21: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2310 |                     EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(y[i+u+ty*rowlength_y].qs, y_high_q[tx], y_low_q[tx], ty)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2249:23: note: shadowed declaration is here
 2249 |         const __m128i tmp =                                          \
      |                       ^~~
ggml.c:2278:17: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2253:23: warning: declaration of ‘bytes’ shadows a previous local [-Wshadow]
 2253 |         const __m256i bytes = _mm256_cvtepu8_epi16(tmp);  \
      |                       ^~~~~
ggml.c:2310:21: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2310 |                     EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(y[i+u+ty*rowlength_y].qs, y_high_q[tx], y_low_q[tx], ty)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2253:23: note: shadowed declaration is here
 2253 |         const __m256i bytes = _mm256_cvtepu8_epi16(tmp);  \
      |                       ^~~~~
ggml.c:2278:17: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2256:23: warning: declaration of ‘pre_shift’ shadows a previous local [-Wshadow]
 2256 |         const __m256i pre_shift =                                    \
      |                       ^~~~~~~~~
ggml.c:2310:21: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2310 |                     EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(y[i+u+ty*rowlength_y].qs, y_high_q[tx], y_low_q[tx], ty)
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2256:23: note: shadowed declaration is here
 2256 |         const __m256i pre_shift =                                    \
      |                       ^~~~~~~~~
ggml.c:2278:17: note: in expansion of macro ‘EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS’
 2278 |                 EXPAND_32_Q4_NIBBLES_INTO_TWO_M256_VECTORS(x[i+u+tx*rowlength_x].qs, x_high_q, x_low_q, tx)
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2313:29: error: array type has incomplete element type ‘__m256i[]’
 2313 |                     __m256i xy_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~~
ggml.c:2313:29: note: declaration of ‘xy_high_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2316:29: error: array type has incomplete element type ‘__m256i[]’
 2316 |                     __m256i xy_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2316:29: note: declaration of ‘xy_low_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2320:29: error: array type has incomplete element type ‘__m256i[]’
 2320 |                     __m256i xy_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~
ggml.c:2320:29: note: declaration of ‘xy_q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2324:28: error: array type has incomplete element type ‘__m256[]’
 2324 |                     __m256 q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                            ^
ggml.c:2324:28: note: declaration of ‘q’ as multidimensional array must have bounds for all dimensions except the first
ggml.c:2324:28: warning: unused variable ‘q’ [-Wunused-variable]
ggml.c:2320:29: warning: unused variable ‘xy_q’ [-Wunused-variable]
 2320 |                     __m256i xy_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~
ggml.c:2316:29: warning: unused variable ‘xy_low_q’ [-Wunused-variable]
 2316 |                     __m256i xy_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2313:29: warning: unused variable ‘xy_high_q’ [-Wunused-variable]
 2313 |                     __m256i xy_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~~
ggml.c:2308:29: warning: unused variable ‘y_low_q’ [-Wunused-variable]
 2308 |                     __m256i y_low_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~
ggml.c:2307:29: warning: unused variable ‘y_high_q’ [-Wunused-variable]
 2307 |                     __m256i y_high_q[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                             ^~~~~~~~
ggml.c:2288:24: warning: unused variable ‘scale’ [-Wunused-variable]
 2288 |                 __m256 scale[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y];
      |                        ^~~~~
ggml.c:2336:43: error: expected expression before ‘;’ token
 2336 |     for (int tx=0;tx<EXPERIMENT_TILESIZE_X;tx++) {
      |                                           ^
ggml.c:2337:47: error: expected expression before ‘;’ token
 2337 |         for (int ty=0;ty<EXPERIMENT_TILESIZE_Y;ty++) {
      |                                               ^
ggml.c:2340:13: error: ‘res’ undeclared (first use in this function)
 2340 |             res = _mm_add_ps( res, _mm256_castps256_ps128( acc[tx][ty] ) );
      |             ^~~
ggml.c:2228:12: warning: unused variable ‘acc’ [-Wunused-variable]
 2228 |     __m256 acc[EXPERIMENT_TILESIZE_X][EXPERIMENT_TILESIZE_Y]; // = 0; // _mm256_setzero_ps();
      |            ^~~
ggml.c:2201:11: warning: unused variable ‘sumf’ [-Wunused-variable]
 2201 |     float sumf = 0.0;
      |           ^~~~
ggml.c:2191:66: warning: parameter ‘s’ set but not used [-Wunused-but-set-parameter]
 2191 | static void seap_ggml_vec_dot_q4_0(const int n, float * restrict s, const void * restrict vx, const void * restrict vy,
      |                                                 ~~~~~~~~~~~~~~~~~^
ggml.c:2192:65: warning: unused parameter ‘dst_stridelength_x’ [-Wunused-parameter]
 2192 |         const int rowlength_x, const int rowlength_y, const int dst_stridelength_x, const int dst_stridelength_y) {
      |                                                       ~~~~~~~~~~^~~~~~~~~~~~~~~~~~
ggml.c: In function ‘tensor_sum_elements’:
ggml.c:6745:23: warning: unused variable ‘p’ [-Wunused-variable]
 6745 |                 void *p = &((float *) tensor->data)[j*tensor->ne[0]+k];
      |                       ^
In file included from ggml.c:12:
ggml.c: In function ‘ggml_compute_forward_mul_mat_q_f32’:
ggml.c:6944:46: error: expected expression before ‘==’ token
 6944 |     assert((ir1-ir0) % EXPERIMENT_TILESIZE_X == 0);
      |                                              ^~
ggml.c:6946:41: error: expected expression before ‘;’ token
 6946 |     int x_stride = EXPERIMENT_TILESIZE_X;
      |                                         ^
ggml.c:6947:37: error: expected expression before ‘)’ token
 6947 |     if (ne11 < EXPERIMENT_TILESIZE_Y) {
      |                                     ^
ggml.c:6991:41: error: expected expression before ‘)’ token
 6991 |         if (ne11 < EXPERIMENT_TILESIZE_Y) {
      |                                         ^
ggml.c:6994:34: error: ‘ic’ undeclared (first use in this function); did you mean ‘i3’?
 6994 |             for (int64_t ic = 0; ic < ne11; ++ic) {
      |                                  ^~
      |                                  i3
ggml.c:7000:46: error: expected expression before ‘)’ token
 7000 |             if ((ne11 % EXPERIMENT_TILESIZE_Y) !=  0) {
      |                                              ^
ggml.c:7001:31: warning: format ‘%i’ expects argument of type ‘int’, but argument 2 has type ‘int64_t’ {aka ‘long int’} [-Wformat=]
 7001 |                 printf("ne11=%i\n",ne11);
      |                              ~^    ~~~~
      |                               |    |
      |                               int  int64_t {aka long int}
      |                              %li
ggml.c:7003:49: error: expected expression before ‘)’ token
 7003 |             assert((ne11 % EXPERIMENT_TILESIZE_Y) ==  0); // make sure we have a multiple of the tilesize
      |                                                 ^
ggml.c:7005:26: warning: declaration of ‘ic’ shadows previous non-variable [-Wshadow]
 7005 |             for (int64_t ic = 0; ic < ne11; ic+=EXPERIMENT_TILESIZE_Y) {
      |                          ^~
ggml.c:7005:70: error: expected expression before ‘)’ token
 7005 |             for (int64_t ic = 0; ic < ne11; ic+=EXPERIMENT_TILESIZE_Y) {
      |                                                                      ^
make: *** [Makefile:145: ggml.o] Error 1

@howard0su
Copy link
Collaborator

I like your idea. Actually it is more common optimization direction in the bias lib. I can help you run but my main dev box doesn't have AVX2 only AVX. (A very old E5). If you can port to AVX, I will help you collect data for sure.

@x02Sylvie
Copy link

when compiling on Windows, MSVC says that m128i_u does not exist, I swapped it to m128i and it seems to compile fine (not sure if thats gonna break something or not).

Other than that, I get crashes with thread count other than 8, I usually went with 14 since 16 is somehow slower than 14 threads on my machine

@diimdeep
Copy link

diimdeep commented Apr 4, 2023

for mac

sysctl -a machdep.cpu > benchmark-results/cpuinfo.txt
MACHINE_ID=$(ioreg -rd1 -c IOPlatformExpertDevice | awk '/IOPlatformUUID/ { split($0, line, "\""); printf("%s\n", line[4]); }')
benchmark-main-threads-2-tilesize-1x1.txt:llama_print_timings:        eval time = 85992.81 ms /    99 runs   (  868.61 ms per run)
benchmark-main-threads-2-tilesize-1x2.txt:llama_print_timings:        eval time = 41981.91 ms /    99 runs   (  424.06 ms per run)
benchmark-main-threads-2-tilesize-1x8.txt:llama_print_timings:        eval time = 38906.24 ms /    99 runs   (  392.99 ms per run)
benchmark-main-threads-2-tilesize-2x1.txt:llama_print_timings:        eval time = 42767.35 ms /    99 runs   (  431.99 ms per run)
benchmark-main-threads-2-tilesize-2x2.txt:llama_print_timings:        eval time = 37795.00 ms /    99 runs   (  381.77 ms per run)
benchmark-main-threads-2-tilesize-2x8.txt:llama_print_timings:        eval time = 38267.93 ms /    99 runs   (  386.54 ms per run)
benchmark-main-threads-2-tilesize-4x1.txt:llama_print_timings:        eval time = 40852.97 ms /    99 runs   (  412.66 ms per run)
benchmark-main-threads-2-tilesize-4x2.txt:llama_print_timings:        eval time = 38172.83 ms /    99 runs   (  385.58 ms per run)
benchmark-main-threads-2-tilesize-4x8.txt:llama_print_timings:        eval time = 39465.25 ms /    99 runs   (  398.64 ms per run)
benchmark-main-threads-2-tilesize-8x1.txt:llama_print_timings:        eval time = 38391.29 ms /    99 runs   (  387.79 ms per run)
benchmark-main-threads-2-tilesize-8x2.txt:llama_print_timings:        eval time = 37773.38 ms /    99 runs   (  381.55 ms per run)
benchmark-main-threads-2-tilesize-8x8.txt:llama_print_timings:        eval time = 38036.29 ms /    99 runs   (  384.20 ms per run)

benchmark-main-threads-2-tilesize-1x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_qAssertion failed: ((ne11 % EXPER_f32, file ggml.c, line 7003.
benchmark-main-threads-2-tilesize-2x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_qAssertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_q_f32, file ggml.c, line 7003.
benchmark-main-threads-2-tilesize-4x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_qAssertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_q_f32, file ggml.c, line 7003.
benchmark-main-threads-2-tilesize-8x4.txt: Building a website can be done in 10 simple steps:Assertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), functioAssertion failed: ((ne11 % EXPERIMENT_TILESIZE_Y) == 0), function ggml_compute_forward_mul_mat_q_f32, file ggml.c, line 7003.

benchmark-results-7CB91888-2F90-5973-B733-F2A92C5F3C3C-20230404-081815.tgz

@SebastianApel
Copy link
Contributor Author

SebastianApel commented Apr 4, 2023

@rabidcopy Thank you for running & sharing.
@x02Sylvie Thank you for trying!

Re: Creashes with thread count 6 and 14

UPDATE: Fixed with 42ad59f

PREVIOUS TEXT:
The current implementation of the tiles requires the matrix size to be a multiple of the tile size. If you divide matrix size by thread count 6, the resulting slice for that thread does not have the property in some cases. Hence the abort on assertion.

I assume this can be fixed.

Would you be willing to share your eval times on the current master, ideally
a) with 2 threads (so it's comparable to this benchmark) and
b) with your "best" threads count (so we know what you can achieve)?

@SebastianApel
Copy link
Contributor Author

SebastianApel commented Apr 4, 2023

I get tons of compiler errors when trying to build this branch. Compiling on Debian Linux. Fails with both GCC 12.2 and Clang 13.

@Ameobea Thanks for trying! The Makefile expected environment variables that are set in run_benchmark.sh and failed without them.

I added defaults in a33cbbe to the Makefile so you can build from command line without the environment variables.

Would you be willing to re-try?

@SebastianApel
Copy link
Contributor Author

for mac

@diimdeep Thank you for running & sharing. I'll look into the problem you've discovered.

@SebastianApel
Copy link
Contributor Author

I like your idea. Actually it is more common optimization direction in the bias lib. I can help you run but my main dev box doesn't have AVX2 only AVX. (A very old E5). If you can port to AVX, I will help you collect data for sure.

@howard0su Thanks for your feedback. I'll look into it, but I'm not sure how easy/hard an AVX port is, so no promisses :-)

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 4, 2023

Here's the benchmark command on master with 2 threads and then 6 threads.
./main -m ../alpaca-7b-native.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 2 --seed 1

llama_print_timings:        load time =  3407.37 ms
llama_print_timings:      sample time =    64.28 ms /   100 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =  5022.31 ms /    14 tokens (  358.74 ms per token)
llama_print_timings:        eval time = 41312.51 ms /    99 runs   (  417.30 ms per run)
llama_print_timings:       total time = 46921.83 ms

./main -m ../alpaca-7b-native.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 6 --seed 1

llama_print_timings:        load time =  2198.91 ms
llama_print_timings:      sample time =    64.25 ms /   100 runs   (    0.64 ms per run)
llama_print_timings: prompt eval time =  2438.03 ms /    14 tokens (  174.14 ms per token)
llama_print_timings:        eval time = 18675.86 ms /    99 runs   (  188.65 ms per run)
llama_print_timings:       total time = 21695.58 ms

Edit: Doing an extensive run through of tile size combinations to see which ones work with 6 threads.

@SebastianApel
Copy link
Contributor Author

Here's the benchmark command on master with 2 threads and then 6 threads.

@rabidcopy Awesome, thank you.

I think I have fixed the thread=6 problem with 42ad59f.

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 4, 2023

Well here's my findings, running 1-12 = x and 1-12 = y with 6 threads, only these didn't abort. (done before your fix)

./benchmark-main-threads-6-tilesize-1x9.txt:llama_print_timings:        eval time = 18480.42 ms /    99 runs   (  186.67 ms per run)
./benchmark-main-threads-6-tilesize-1x1.txt:llama_print_timings:        eval time = 18696.10 ms /    99 runs   (  188.85 ms per run)
./benchmark-main-threads-6-tilesize-1x11.txt:llama_print_timings:        eval time = 19354.17 ms /    99 runs   (  195.50 ms per run)
./benchmark-main-threads-6-tilesize-1x8.txt:llama_print_timings:        eval time = 18542.98 ms /    99 runs   (  187.30 ms per run)
./benchmark-main-threads-6-tilesize-1x12.txt:llama_print_timings:        eval time = 18783.52 ms /    99 runs   (  189.73 ms per run)
./benchmark-main-threads-6-tilesize-1x2.txt:llama_print_timings:        eval time = 18442.82 ms /    99 runs   (  186.29 ms per run)
./benchmark-main-threads-6-tilesize-1x10.txt:llama_print_timings:        eval time = 18607.76 ms /    99 runs   (  187.96 ms per run)

@ggerganov
Copy link
Member

Yes, improving the dot-product based matrix multiplication with a block based approach would be great.
Please open a new PR after rebasing on latest master and provide latest numbers that you obtain before and after the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants