-
Notifications
You must be signed in to change notification settings - Fork 11.5k
DRAFT: AVX2: Potential additional performance increases #749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Running the benchmarks now. Will share results when it finishes. Edit: Here it is. Several combinations seemed to crash. Here's the top ones to make it easier to parse through which combinations were the fastest on my machine. Going to go with 8x1 and compare speed with current master. Edit: Ah, doesn't really seem to work with thread counts other than 2/4/8 for me without crashing. 6 being the sweet spot on my 6 core/12 thread cpu.
benchmark-results-023ced9dd49b4aabacdad4eb281af83a-20230403-165819.zip |
I get tons of compiler errors when trying to build this branch. Compiling on Debian Linux. Fails with both GCC 12.2 and Clang 13. Compiler Output
|
I like your idea. Actually it is more common optimization direction in the bias lib. I can help you run but my main dev box doesn't have AVX2 only AVX. (A very old E5). If you can port to AVX, I will help you collect data for sure. |
when compiling on Windows, MSVC says that m128i_u does not exist, I swapped it to m128i and it seems to compile fine (not sure if thats gonna break something or not). Other than that, I get crashes with thread count other than 8, I usually went with 14 since 16 is somehow slower than 14 threads on my machine |
for mac
benchmark-results-7CB91888-2F90-5973-B733-F2A92C5F3C3C-20230404-081815.tgz |
@rabidcopy Thank you for running & sharing. Re: Creashes with thread count 6 and 14 UPDATE: Fixed with 42ad59f PREVIOUS TEXT:
Would you be willing to share your eval times on the current master, ideally |
@Ameobea Thanks for trying! The Makefile expected environment variables that are set in run_benchmark.sh and failed without them. I added defaults in a33cbbe to the Makefile so you can build from command line without the environment variables. Would you be willing to re-try? |
@diimdeep Thank you for running & sharing. I'll look into the problem you've discovered. |
@howard0su Thanks for your feedback. I'll look into it, but I'm not sure how easy/hard an AVX port is, so no promisses :-) |
Here's the benchmark command on master with 2 threads and then 6 threads.
Edit: Doing an extensive run through of tile size combinations to see which ones work with 6 threads. |
…is not a multiple of TILESIZE_X
@rabidcopy Awesome, thank you. I think I have fixed the thread=6 problem with 42ad59f. |
Well here's my findings, running 1-12 = x and 1-12 = y with 6 threads, only these didn't abort. (done before your fix)
|
Yes, improving the dot-product based matrix multiplication with a block based approach would be great. |
Hi @sw, @rabidcopy, @ggerganov, @Ameobea, @howard0su and everybody else interested,
Would you be willing to give me some feedback on this PR?
PLEASE NOTE: The PR is still in EARLY DRAFT / experimental and NOT ready to be merged yet. It needs significant cleanup.
However, I personally think it is promising and worth a discussion.
The modifications in this PR change the matrix multiplication from the "dot vector" approach to something similar to "tiled matrix multiplication" approach.
The "tiled matrix multiplication" is supposed to be more cache efficient than the "dot vector" approach.
The good news:
My questions are:
As I said, the code in it's current form is NOT ready to be merged.
But before I spend more time on it, I would appreciate some feedback/thoughts on your side.
HOWTO run the benchmarks / test cases (Linux only):