Emu3是由北京智源人工智能研究院推出的一款原生多模态世界模型。
Emu3模型的特点在于它能够处理和理解多种类型的数据,包括文本、图像以及视频,并且在这些不同模态之间实现统一的输入和输出。
Emu3的核心是下一个token预测,属于一种自回归方法,模型被训练预测序列中的下一个元素,无论是文本、图像还是视频。
为了处理大规模的数据集,Emu3采用了张量并行、上下文并行和数据并行相结合的方法,以便有效地利用计算资源。
Emu3具有广泛的应用潜力,可以用于诸如自动图像字幕生成、基于文本的图像编辑、视频合成等多种场景。
github项目地址:https://github.com/baaivision/Emu3。
一、环境安装
1、python环境
建议安装python版本在3.10以上。
2、pip库安装
pip install torch==2.2.1+cu118 torchvision==0.17.1+cu118 torchaudio==2.2.1 --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
3、Emu3-Chat模型下载:
git lfs install
git clone https://www.modelscope.cn/BAAI/Emu3-Chat.git
4、Emu3-Gen模型下载:
git lfs install
git clone https://www.modelscope.cn/BAAI/Emu3-Gen.git
5、Emu3-VisionTokenizer模型下载:
git lfs install
git clone https://www.modelscope.cn/BAAI/Emu3-VisionTokenizer.git
二、功能测试
1、运行测试:
(1)python代码调用测试
from PIL import Image
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
from transformers.generation.configuration_utils import GenerationConfig
import torch
from emu3.mllm.processing_emu3 import Emu3Processor
class Emu3ImageDescriber:
def __init__(self, model_path, vq_path, device="cuda:0"):
self.device = device
# Load models and processor
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map=device,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
trust_remote_code=True,
).eval()
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, padding_side="left")
self.image_processor = AutoImageProcessor.from_pretrained(vq_path, trust_remote_code=True)
self.image_tokenizer = AutoModel.from_pretrained(vq_path, device_map=device, trust_remote_code=True).eval()
self.processor = Emu3Processor(self.image_processor, self.image_tokenizer, self.tokenizer)
def describe_image(self, image_path, prompt="Please describe the image", max_new_tokens=1024):
try:
# Open image
image = Image.open(image_path)
text = [prompt] * 2
images = [image] * 2
# Process inputs
inputs = self.processor(
text=text,
image=images,
mode='U',
padding_image=True,
padding="longest",
return_tensors="pt",
)
# Generate configuration
generation_config = GenerationConfig(
pad_token_id=self.tokenizer.pad_token_id,
bos_token_id=self.tokenizer.bos_token_id,
eos_token_id=self.tokenizer.eos_token_id
)
# Generate outputs
outputs = self.model.generate(
inputs.input_ids.to(self.device),
generation_config=generation_config,
max_new_tokens=max_new_tokens,
attention_mask=inputs.attention_mask.to(self.device),
)
# Decode outputs
outputs = outputs[:, inputs.input_ids.shape[-1]:]
answers = self.processor.batch_decode(outputs, skip_special_tokens=True)
return answers
except Exception as e:
return str(e)
if __name__ == "__main__":
# Paths
EMU_HUB = "BAAI/Emu3-Chat"
VQ_HUB = "BAAI/Emu3-VisionTokenizer"
image_path = "assets/demo.png"
# Initialize and execute
describer = Emu3ImageDescriber(EMU_HUB, VQ_HUB)
descriptions = describer.describe_image(image_path)
# Print results
for desc in descriptions:
print(desc)
未完......
更多详细的欢迎关注:杰哥新技术