-
Notifications
You must be signed in to change notification settings - Fork 11.7k
clip : Experimental support for Gemma 3 vision #12344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ggerganov
approved these changes
Mar 12, 2025
ngxson
added a commit
that referenced
this pull request
Mar 12, 2025
* llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64
ishaangandhi
pushed a commit
to ishaangandhi/llama.cpp
that referenced
this pull request
Mar 12, 2025
…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64
18 tasks
jpohhhh
pushed a commit
to Telosnex/llama.cpp
that referenced
this pull request
Mar 14, 2025
…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64
arthw
pushed a commit
to arthw/llama.cpp
that referenced
this pull request
Mar 19, 2025
…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64
looking forward to see this in master branch 👯 |
Any hope you'd be able to support this for the server? |
@towel short term you may be able to use one of the downstream server projects, for example KoboldCpp integrated this implementation in its API server. |
7 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is this?
Follow up the text-only PR: #12343
Note: Vision capability is available on these model sizes: 4b, 12b and 27b
This PR adds experimental support for Gemma 3 vision based on the existing
clip.cpp
infrastructure.A new binary
llama-gemma3-cli
is added to provide a playground, support chat mode and simple completion mode.Important
Please note that this is not intended to be a prod-ready product, but mostly acts as a demo. Please refer to #11292 for future plan of the vision support.
How to try this?
Step 1: Get the text model
See previous PR: #12343
Step 2: Get the mmproj (multi-modal projection) model
Option 1: Download the pre-quantized version from HF: https://huggingface.co/collections/ggml-org/gemma-3-67d126315ac810df1ad9e913
(You must download both the text model and the
mmproj
file)Option 2: Convert it yourself
We will need
model.gguf
generated from theconvert_hf_to_gguf.py
script above, plus vision tower saved inmmproj.gguf
Firstly, get the
mmproj.gguf
file:Step 3: Compile and run
Clone this repo and compile
llama-gemma3-cli
cd llama.cppcmake -B build cmake --build build -j --target llama-gemma3-cli
Run it:
Example output: