Skip to content

clip : Experimental support for Gemma 3 vision #12344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 12, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 12, 2025

What is this?

Follow up the text-only PR: #12343

Note: Vision capability is available on these model sizes: 4b, 12b and 27b

This PR adds experimental support for Gemma 3 vision based on the existing clip.cpp infrastructure.

A new binary llama-gemma3-cli is added to provide a playground, support chat mode and simple completion mode.

Important

Please note that this is not intended to be a prod-ready product, but mostly acts as a demo. Please refer to #11292 for future plan of the vision support.

How to try this?

Step 1: Get the text model

See previous PR: #12343

Step 2: Get the mmproj (multi-modal projection) model

Option 1: Download the pre-quantized version from HF: https://huggingface.co/collections/ggml-org/gemma-3-67d126315ac810df1ad9e913

(You must download both the text model and the mmproj file)

Option 2: Convert it yourself

We will need model.gguf generated from the convert_hf_to_gguf.py script above, plus vision tower saved in mmproj.gguf

Firstly, get the mmproj.gguf file:

cd gemma-3-4b-it
python ~/work/llama.cpp-gemma/examples/llava/gemma3_convert_encoder_to_gguf.py .
# output file: mmproj.gguf

Step 3: Compile and run

Clone this repo and compile llama-gemma3-cli

cd llama.cppcmake -B build
cmake --build build -j --target llama-gemma3-cli

Run it:

./build/bin/llama-gemma3-cli -m model.gguf --mmproj mmproj.gguf

Example output:

 Running in chat mode, available commands:
   /image <path>    load an image
   /clear           clear the chat history
   /quit or /exit   exit the program

> hi    
Hello! How's it going today? 

Is there something specific on your mind, or were you simply saying hi? 😊 

I’m here to chat, answer questions, help with creative tasks, or just listen – whatever you need!

> /image ../models/bliss.png
Encoding image ../models/bliss.png

> what is that
That's a beautiful image!

@github-actions github-actions bot added examples python python script changes labels Mar 12, 2025
@ngxson ngxson requested review from ggerganov and slaren and removed request for ggerganov March 12, 2025 06:47
@ngxson ngxson merged commit afcc335 into xsn/gemma3_text Mar 12, 2025
44 of 47 checks passed
ngxson added a commit that referenced this pull request Mar 12, 2025
* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
ishaangandhi pushed a commit to ishaangandhi/llama.cpp that referenced this pull request Mar 12, 2025
…org#12343)

* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (ggml-org#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
@jkbe jkbe mentioned this pull request Mar 12, 2025
18 tasks
jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
…org#12343)

* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (ggml-org#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
…org#12343)

* llama : Add Gemma 3 text-only support

* fix python coding style

* fix compile on ubuntu

* python: fix style

* fix ubuntu compile

* fix build on ubuntu (again)

* fix ubuntu build, finally

* clip : Experimental support for Gemma 3 vision (ggml-org#12344)

* clip : Experimental support for Gemma 3 vision

* fix build

* PRId64
@liyimeng
Copy link

looking forward to see this in master branch 👯

@towel
Copy link

towel commented Mar 19, 2025

Any hope you'd be able to support this for the server?

@henk717
Copy link

henk717 commented Mar 19, 2025

@towel short term you may be able to use one of the downstream server projects, for example KoboldCpp integrated this implementation in its API server.

@ngxson ngxson deleted the xsn/gemma3_vision branch May 1, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants