Skip to content

Fix for OpenCL / clbast builds on macOS. #1329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 5, 2023

Conversation

IonoclastBrigham
Copy link
Contributor

On macOS we have to load OpenCL as a framework rather than just a library.

On macOS we have to load OpenCL as a framework rather than just a library.
@prusnak prusnak merged commit 2d13786 into ggml-org:master May 5, 2023
@swittk
Copy link
Contributor

swittk commented May 5, 2023

Just wondering; I did get this to compile by adding the CFLAGS and LDFLAGS pointing to the homebrew installation of CLBlast (e.g. -I/opt/homebrew/Cellar/clblast/1.5.3_1/include) and doing a similar change in the Makefile, and setting LLAMA_CLBLAST=1 LLAMA_NO_ACCELERATE=1 (LLAMA_NO_ACCELERATE is needed because otherwise the opencl stuff like ggml_cl_init are ifdef'd away), but the main executable has always crashed with the following error:

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: Apple Device: Apple M1 Max
OpenCL clCreateCommandQueue error -30 at ggml-opencl.c:229

I looked it up and -30 stands for CL_INVALID_VALUE, and I've seen reports of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE not being supported on Mac OS.
So I'm not sure if you encounter this too (or if not, what Mac are you using).

@IonoclastBrigham
Copy link
Contributor Author

IonoclastBrigham commented May 5, 2023 via email

@swittk
Copy link
Contributor

swittk commented May 5, 2023

I've tried setting the option in clCreateCommandQueue to 0 instead, and it builds and runs.
However it appears that with this method (Accelerate off, CLBlast on, queue doesn't support out of order execution), the prompt eval does get faster, while the generation time stays mostly the same. GPU usage spikes up only during prompt eval, and after that it's mostly CPU during generation.
I get the following timings when comparing with the same prompts and same seed on Vicuna-13B onmaster-2d13786 (the generated text are the same).

CLBlast (make LLAMA_CLBLAST=1 LLAMA_NO_ACCELERATE=1)

llama_print_timings:        load time =   409.27 ms
llama_print_timings:      sample time =   351.65 ms /   468 runs   (    0.75 ms per run)
llama_print_timings: prompt eval time =  5177.26 ms /    60 tokens (   86.29 ms per token)
llama_print_timings:        eval time = 48286.11 ms /   468 runs   (  103.18 ms per run)
llama_print_timings:       total time = 76624.02 ms

Accelerate (standard make)

llama_print_timings:        load time =   736.65 ms
llama_print_timings:      sample time =   342.20 ms /   468 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =  5785.94 ms /    60 tokens (   96.43 ms per token)
llama_print_timings:        eval time = 50050.11 ms /   468 runs   (  106.94 ms per run)
llama_print_timings:       total time = 83148.13 ms

Generally it feels like the eval part is still using the CPU like before, but the prompt eval is running on the GPU and is slightly faster than Accelerate.

@Green-Sky
Copy link
Collaborator

Generally it feels like the eval part is still using the CPU like before, but the prompt eval is running on the GPU and is slightly faster than Accelerate.

Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance.

@swittk
Copy link
Contributor

swittk commented May 6, 2023

Ah. I see, thanks. Prompt processing speedup is still great I guess.

KerfuffleV2 pushed a commit to KerfuffleV2/llama.cpp that referenced this pull request May 6, 2023
@deep-pipeline
Copy link

Hi @swittk just wondering, have you tested above with longer prompts etc? - it seems to me that others using GPU enabled BLAS libraries with llama.cpp (cuBLAS, rocBLAS) are getting significant performance improvements with more than just prompt eval offloading work onto GPU so I’m wondering if/why a CL-BLAS framework compiled natively wouldn’t be utilising Apple Silicon GPUs for significant performance improvement with bigger batch sizes/prompts?

Is there some missing CL-BLAS framework related switch which needs to be enabled for GPU/MPS use? eg. Number of GPU threads to use etc? Does the homebrew version of CL-BLAS properly engage Apple Silicon MPS? I’ve a feeling that there is performance being left on the table with underutilisation of Apple Silicon and had hoped BLAS might pick up a bit more of that performance?

Any light you could shed on this would be appreciated. Thanks in advance.

@IonoclastBrigham
Copy link
Contributor Author

IonoclastBrigham commented May 17, 2023 via email

@deep-pipeline
Copy link

Interesting - thanks @IonoclastBrigham - I guess you must be using a thunderbolt eGPU housing for your AMD Radeon, so maybe you are using an Intel MacBook on an older MacOS? I seem to recall reading there were serious headaches about OS updates, hardware updates and connectivity - what the situation like now? Is it viable to use a thunderbolt eGPU to house a GPU (Nvidia/AMD) plugged in to an Apple Silicon machine running latest 13.3.x OS or is it still a nightmare to link stuff up..?

As re: device ENV setting, I take it that’s a CL-BLAS related ENV variable? In your case with an AMD device does CL-BLAS then link to rocBLAS? Just trying to get my head around the order of the stack and linked libraries.. and where any device setting might enable MPS..

Cheers!

You might need to select the device to run on with an env Var (don't recall the name off hand), but on my MacBook I have to set it to use device 2, to see the speedup from running on my Radeon Pro Vega 20

On Wed, May 17, 2023, 2:54 AM deep-pipeline @.> wrote: Hi @swittk https://github.com/swittk just wondering, have you tested above with longer prompts etc? - it seems to me that others using GPU enabled BLAS libraries with llama.cpp (cuBLAS, rocBLAS) are getting significant performance improvements with more than just prompt eval offloading work onto GPU so I’m wondering if/why a CL-BLAS framework compiled natively wouldn’t be utilising Apple Silicon GPUs for significant performance improvement with bigger batch sizes/prompts? Is there some missing CL-BLAS framework related switch which needs to be enabled for GPU/MPS use? eg. Number of GPU threads to use etc? Does the homebrew version of CL-BLAS properly engage Apple Silicon MPS? I’ve a feeling that there is performance being left on the table with underutilisation of Apple Silicon and had hoped BLAS might pick up a bit more of that performance? Any light you could shed on this would be appreciated. Thanks in advance. — Reply to this email directly, view it on GitHub <#1329 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYUBFA5YXMVYHZ7G5JVDRLXGRY3HANCNFSM6AAAAAAXWNFPVA . You are receiving this because you authored the thread.Message ID: @.>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants