A model for Vision-language Understanding with Advanced Large Language Models #1050

Martin-Laclaustra · 2023-04-19T00:42:50Z

Martin-Laclaustra
Apr 19, 2023

This model: https://github.com/Vision-CAIR/MiniGPT-4
seems a good candidate for being implemented in ggml, as it mimics the capabilities of GPT-4 in terms of image interpretation.
It relies on BLIP-2 as visual encoder, which I cannot tell whether has an structure easily implemented in ggml.
Thank you for all the great work!

LiliumSancta · 2023-04-19T00:48:58Z

LiliumSancta
Apr 19, 2023

This is another model with similar functionality https://llava-vl.github.io/
But I have no idea how that would go...

1 reply

Green-Sky Oct 9, 2023
Collaborator

#3436 wip example in this repository

Ghorthalon · 2023-04-25T10:30:18Z

Ghorthalon
Apr 25, 2023

I would personally love this. I'm blind and rely on accessibility tools to use computers of any kind. The only platform that has any kind of on-device image recognition at all is Apple, with iOS and MacOS, and even those are very primitive. Having access to something like this on my own device without needing a massive GPU would be huge. So huge I can't really explain how big this actually is. If it is at all possible, I would very, very much appreciate this working.

0 replies

Don-Chad · 2023-05-02T08:11:16Z

Don-Chad
May 2, 2023

I would love to help working on this people! Also having someone with eye disease in the family, this could be immensly valuable. Clip is not very heavy it seems, so with LLAMA.CPP this could run on a cellphone I hope. My ML knowledge is rudimentary unfortunately;

I tried rebuilding the mini-GPT demo, forcing it to 'mps' to run on m1 mac as a first step. It comes a fair way it seems:

python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --device mps
Initializing Chat
Loading VIT
Loading VIT Done
Loading Q-Former
Loading Q-Former Done
Loading LLAMA

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

Loading checkpoint shards: 0%|

..AssertionError: Torch not compiled with CUDA enabled

--> the full output for reference: https://app.warp.dev/block/iieg5EQ8V4qGTuPkHf44FK

.. here is hangs - so this is part of the tokenizer (seemingly using bitsandbytes). I believe the tokenizer is used to scale the model back from 16 bit to 8 bit, and this needs CUDA in this version of the tokenizer. Or maybe it's just loading the model - this is unclear to me. Possibly an 8 bit version can be provided from the start, or maybe we can replace this tokenizer with a mps friendly / GGML approach? Anyone with experience in this regard? I found the google colab the quickest way to download all the needed models easily:

https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R#scrollTo=69XCMTuCAVfO

Possibly also the model can be brought down to 4bit also? Would be much appreciated if someone with knowledge can shine it's light on it.

Happy to share my Adjusted demo.py to force 'mps'.

1 reply

Don-Chad May 2, 2023

Actually https://llava-vl.github.io/ is much further already - it's brought to 4bit, and works with a gui already. Still only for > 12GB GPU CUDA as far as I can see.

links:

https://github.com/oobabooga/text-generation-webui/tree/main/extensions/llava

https://huggingface.co/wojtab/llava-13b-v0-4bit-128g/blob/main/llava-13b-v0-4bit-128g.safetensors

Green-Sky · 2023-06-24T17:37:07Z

Green-Sky
Jun 24, 2023
Collaborator

#1910 is adding some of the needed code

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A model for Vision-language Understanding with Advanced Large Language Models #1050

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

A model for Vision-language Understanding with Advanced Large Language Models #1050

Martin-Laclaustra Apr 19, 2023

Replies: 4 comments · 2 replies

LiliumSancta Apr 19, 2023

Green-Sky Oct 9, 2023 Collaborator

Ghorthalon Apr 25, 2023

Don-Chad May 2, 2023

Don-Chad May 2, 2023

Green-Sky Jun 24, 2023 Collaborator

Martin-Laclaustra
Apr 19, 2023

Replies: 4 comments 2 replies

LiliumSancta
Apr 19, 2023

Green-Sky Oct 9, 2023
Collaborator

Ghorthalon
Apr 25, 2023

Don-Chad
May 2, 2023

Green-Sky
Jun 24, 2023
Collaborator