Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Handle default marker insertion for LFM2
  • Loading branch information
tdakhran committed Nov 28, 2025
commit 1fea2d1b068ceeed3fccebac21221ceaf68b1589
4 changes: 2 additions & 2 deletions tools/mtmd/mtmd-cli.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ int main(int argc, char ** argv) {
g_is_generating = true;
if (params.prompt.find(mtmd_default_marker()) == std::string::npos) {
for (size_t i = 0; i < params.image.size(); i++) {
params.prompt += mtmd_default_marker();
params.prompt = mtmd::mtmd_add_default_marker(ctx.ctx_vision.get(), params.prompt);
}
}
common_chat_msg msg;
Expand Down Expand Up @@ -378,7 +378,7 @@ int main(int argc, char ** argv) {
std::string media_path = line.substr(7);
if (ctx.load_media(media_path)) {
LOG("%s %s loaded\n", media_path.c_str(), is_image ? "image" : "audio");
content += mtmd_default_marker();
content = mtmd::mtmd_add_default_marker(ctx.ctx_vision.get(), content);
}
// else, error is already printed by libmtmd
continue;
Expand Down
9 changes: 9 additions & 0 deletions tools/mtmd/mtmd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1103,3 +1103,12 @@ void mtmd_log_set(ggml_log_callback log_callback, void * user_data) {
g_logger_state.log_callback = log_callback ? log_callback : clip_log_callback_default;
g_logger_state.log_callback_user_data = user_data;
}

std::string mtmd::mtmd_add_default_marker(mtmd_context *ctx, const std::string &str) {
// for LFM2 image embeddings positioned before the text
if (ctx && ctx->ctx_v && clip_get_projector_type(ctx->ctx_v) == PROJECTOR_TYPE_LFM2) {
return mtmd_default_marker() + str;
}

return str + mtmd_default_marker();
}
3 changes: 3 additions & 0 deletions tools/mtmd/mtmd.h
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,9 @@ struct input_chunks {
}
};

// insert mtmd_default_marker() into given string, position depends on the projector
std::string mtmd_add_default_marker(mtmd_context *ctx, const std::string &str);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to remove API because:

  • It's not compatible with pure-C ABI
  • The ordering is actually controlled by users via API. This function only change llama-mtmd-cli, but make no changes to llama-server

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function like mtmd_get_image_placement can be a better solution, it returns one of these 3 values which should cover all use cases possible:

IMAGE_PLACEMENT_NONE, // place images freely inside the message
IMAGE_PLACEMENT_BEGIN, // place images in the beginning of the image
IMAGE_PLACEMENT_END, // place in the end

But IMO this can better be a dedicated refactoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it from this PR. For now, passing a placeholder directly in -p "__media__>OCR." achieves the same. But I wanted the placement to be correct by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that for the server, the order is similar to the order of data in the request, for the CLI, it was using content += mtmd_default_marker();.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CLI, it was done this way because you can actually add multiple images in CLI mode, something like this:

> This is the first step:
> /image step1.png
> Then the next step:
> /image step2.png
> What do you see?

In this case, we expect the image and text prompts to follow exactly the same order in the input.


} // namespace mtmd

#endif
Expand Down