A model for Vision-language Understanding with Advanced Large Language Models #1050
Replies: 4 comments 2 replies
-
This is another model with similar functionality https://llava-vl.github.io/ |
Beta Was this translation helpful? Give feedback.
-
I would personally love this. I'm blind and rely on accessibility tools to use computers of any kind. The only platform that has any kind of on-device image recognition at all is Apple, with iOS and MacOS, and even those are very primitive. Having access to something like this on my own device without needing a massive GPU would be huge. So huge I can't really explain how big this actually is. If it is at all possible, I would very, very much appreciate this working. |
Beta Was this translation helpful? Give feedback.
-
I would love to help working on this people! Also having someone with eye disease in the family, this could be immensly valuable. Clip is not very heavy it seems, so with LLAMA.CPP this could run on a cellphone I hope. My ML knowledge is rudimentary unfortunately; I tried rebuilding the mini-GPT demo, forcing it to 'mps' to run on m1 mac as a first step. It comes a fair way it seems: python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --device mps ===================================BUG REPORT=================================== Loading checkpoint shards: 0%| ..AssertionError: Torch not compiled with CUDA enabled --> the full output for reference: https://app.warp.dev/block/iieg5EQ8V4qGTuPkHf44FK .. here is hangs - so this is part of the tokenizer (seemingly using bitsandbytes). I believe the tokenizer is used to scale the model back from 16 bit to 8 bit, and this needs CUDA in this version of the tokenizer. Or maybe it's just loading the model - this is unclear to me. Possibly an 8 bit version can be provided from the start, or maybe we can replace this tokenizer with a mps friendly / GGML approach? Anyone with experience in this regard? I found the google colab the quickest way to download all the needed models easily: Possibly also the model can be brought down to 4bit also? Would be much appreciated if someone with knowledge can shine it's light on it. Happy to share my Adjusted demo.py to force 'mps'. |
Beta Was this translation helpful? Give feedback.
-
#1910 is adding some of the needed code |
Beta Was this translation helpful? Give feedback.
-
This model: https://github.com/Vision-CAIR/MiniGPT-4
seems a good candidate for being implemented in ggml, as it mimics the capabilities of GPT-4 in terms of image interpretation.
It relies on BLIP-2 as visual encoder, which I cannot tell whether has an structure easily implemented in ggml.
Thank you for all the great work!
Beta Was this translation helpful? Give feedback.
All reactions