clip : Experimental support for Gemma 3 vision #12344

ngxson · 2025-03-12T06:47:23Z

What is this?

Follow up the text-only PR: #12343

Note: Vision capability is available on these model sizes: 4b, 12b and 27b

This PR adds experimental support for Gemma 3 vision based on the existing clip.cpp infrastructure.

A new binary llama-gemma3-cli is added to provide a playground, support chat mode and simple completion mode.

Important

Please note that this is not intended to be a prod-ready product, but mostly acts as a demo. Please refer to #11292 for future plan of the vision support.

How to try this?

Step 1: Get the text model

See previous PR: #12343

Step 2: Get the mmproj (multi-modal projection) model

Option 1: Download the pre-quantized version from HF: https://huggingface.co/collections/ggml-org/gemma-3-67d126315ac810df1ad9e913

(You must download both the text model and the mmproj file)

Option 2: Convert it yourself

We will need model.gguf generated from the convert_hf_to_gguf.py script above, plus vision tower saved in mmproj.gguf

Firstly, get the mmproj.gguf file:

cd gemma-3-4b-it
python ~/work/llama.cpp-gemma/examples/llava/gemma3_convert_encoder_to_gguf.py .
# output file: mmproj.gguf

Step 3: Compile and run

Clone this repo and compile llama-gemma3-cli

cd llama.cppcmake -B build
cmake --build build -j --target llama-gemma3-cli

Run it:

./build/bin/llama-gemma3-cli -m model.gguf --mmproj mmproj.gguf

Example output:

 Running in chat mode, available commands:
   /image <path>    load an image
   /clear           clear the chat history
   /quit or /exit   exit the program

> hi    
Hello! How's it going today? 

Is there something specific on your mind, or were you simply saying hi? 😊 

I’m here to chat, answer questions, help with creative tasks, or just listen – whatever you need!

> /image ../models/bliss.png
Encoding image ../models/bliss.png

> what is that
That's a beautiful image!

* llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64

…org#12343) * llama : Add Gemma 3 text-only support * fix python coding style * fix compile on ubuntu * python: fix style * fix ubuntu compile * fix build on ubuntu (again) * fix ubuntu build, finally * clip : Experimental support for Gemma 3 vision (ggml-org#12344) * clip : Experimental support for Gemma 3 vision * fix build * PRId64

liyimeng · 2025-03-19T06:01:32Z

looking forward to see this in master branch 👯

towel · 2025-03-19T11:31:04Z

Any hope you'd be able to support this for the server?

henk717 · 2025-03-19T15:27:38Z

@towel short term you may be able to use one of the downstream server projects, for example KoboldCpp integrated this implementation in its API server.

clip : Experimental support for Gemma 3 vision

6313b97

github-actions bot added examples python python script changes labels Mar 12, 2025

ngxson requested review from ggerganov and slaren and removed request for ggerganov March 12, 2025 06:47

Merge branch 'xsn/gemma3_text' into xsn/gemma3_vision

a987caa

ngxson mentioned this pull request Mar 12, 2025

llama : Add Gemma 3 support (+ experimental vision capability) #12343

Merged

ngxson added 4 commits March 12, 2025 08:04

Merge branch 'xsn/gemma3_text' into xsn/gemma3_vision

c02e67e

Merge branch 'xsn/gemma3_text' into xsn/gemma3_vision

5cf1cc5

fix build

6e1c4f0

PRId64

b3e9b3e

ggerganov approved these changes Mar 12, 2025

View reviewed changes

ngxson merged commit afcc335 into xsn/gemma3_text Mar 12, 2025
44 of 47 checks passed

jkbe mentioned this pull request Mar 12, 2025

server: Bring back multimodal support #8010

Closed

18 tasks

a-ghorbani mentioned this pull request Mar 14, 2025

[Feat]: Support for multimodal models (e.g., Phi-4-multimodal) a-ghorbani/pocketpal-ai#230

Open

aropb mentioned this pull request Mar 23, 2025

[BUG]: Error loading the LLava model SciSharp/LLamaSharp#1136

Open

getnamo mentioned this pull request Mar 24, 2025

Vision support (e.g. gemma3/llava) getnamo/Llama-Unreal#36

Open

7 tasks

Nikims mentioned this pull request Apr 4, 2025

Add support for gemma 3 in the server? #12762

Open

wisng mentioned this pull request Apr 18, 2025

feat: pass an image as part of the evaluation withcatai/node-llama-cpp#88

Open

mathav95raj mentioned this pull request Apr 28, 2025

Gemma 3 support guinmoon/LLMFarm#123

Open

zhouwg mentioned this pull request Apr 30, 2025

LLM: integrate Gemma3-4B for multi-modal inference on Android phone kantv-ai/kantv#295

Merged

zhouwg mentioned this pull request May 1, 2025

project: release v1.6.7 kantv-ai/kantv#296

Merged

ngxson deleted the xsn/gemma3_vision branch May 1, 2025 20:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clip : Experimental support for Gemma 3 vision #12344

clip : Experimental support for Gemma 3 vision #12344

ngxson commented Mar 12, 2025 •

edited

Loading

liyimeng commented Mar 19, 2025

towel commented Mar 19, 2025

henk717 commented Mar 19, 2025

clip : Experimental support for Gemma 3 vision #12344

clip : Experimental support for Gemma 3 vision #12344

Conversation

ngxson commented Mar 12, 2025 • edited Loading

What is this?

How to try this?

Step 1: Get the text model

Step 2: Get the mmproj (multi-modal projection) model

Step 3: Compile and run

liyimeng commented Mar 19, 2025

towel commented Mar 19, 2025

henk717 commented Mar 19, 2025

ngxson commented Mar 12, 2025 •

edited

Loading