原生多模态世界模型Emu3分享

最新推荐文章于 2025-07-07 13:13:10 发布

原创最新推荐文章于 2025-07-07 13:13:10 发布 · 481 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#AIGC #人工智能

AIGC 同时被 2 个专栏收录

45 篇文章

订阅专栏

多模态

15 篇文章

订阅专栏

Emu3是由北京智源人工智能研究院推出的一款原生多模态世界模型。

Emu3模型的特点在于它能够处理和理解多种类型的数据，包括文本、图像以及视频，并且在这些不同模态之间实现统一的输入和输出。

Emu3的核心是下一个token预测，属于一种自回归方法，模型被训练预测序列中的下一个元素，无论是文本、图像还是视频。

为了处理大规模的数据集，Emu3采用了张量并行、上下文并行和数据并行相结合的方法，以便有效地利用计算资源。

Emu3具有广泛的应用潜力，可以用于诸如自动图像字幕生成、基于文本的图像编辑、视频合成等多种场景。

github项目地址：https://github.com/baaivision/Emu3。

一、环境安装

1、python环境

建议安装python版本在3.10以上。

2、pip库安装

pip install torch==2.2.1+cu118 torchvision==0.17.1+cu118 torchaudio==2.2.1 --extra-index-url https://download.pytorch.org/whl/cu118

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

3、Emu3-Chat模型下载：

git lfs install

git clone https://www.modelscope.cn/BAAI/Emu3-Chat.git

4、Emu3-Gen模型下载：

git lfs install

git clone https://www.modelscope.cn/BAAI/Emu3-Gen.git

5、Emu3-VisionTokenizer模型下载：

git lfs install

git clone https://www.modelscope.cn/BAAI/Emu3-VisionTokenizer.git

二、功能测试

1、运行测试：

（1）python代码调用测试

from PIL import Image
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
from transformers.generation.configuration_utils import GenerationConfig
import torch
from emu3.mllm.processing_emu3 import Emu3Processor

class Emu3ImageDescriber:
    def __init__(self, model_path, vq_path, device="cuda:0"):
        self.device = device
        
        # Load models and processor
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            device_map=device,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            trust_remote_code=True,
        ).eval()
        
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, padding_side="left")
        self.image_processor = AutoImageProcessor.from_pretrained(vq_path, trust_remote_code=True)
        self.image_tokenizer = AutoModel.from_pretrained(vq_path, device_map=device, trust_remote_code=True).eval()
        self.processor = Emu3Processor(self.image_processor, self.image_tokenizer, self.tokenizer)

    def describe_image(self, image_path, prompt="Please describe the image", max_new_tokens=1024):
        try:
            # Open image
            image = Image.open(image_path)
            text = [prompt] * 2
            images = [image] * 2
            
            # Process inputs
            inputs = self.processor(
                text=text,
                image=images,
                mode='U',
                padding_image=True,
                padding="longest",
                return_tensors="pt",
            )
            
            # Generate configuration
            generation_config = GenerationConfig(
                pad_token_id=self.tokenizer.pad_token_id,
                bos_token_id=self.tokenizer.bos_token_id,
                eos_token_id=self.tokenizer.eos_token_id
            )
            
            # Generate outputs
            outputs = self.model.generate(
                inputs.input_ids.to(self.device),
                generation_config=generation_config,
                max_new_tokens=max_new_tokens,
                attention_mask=inputs.attention_mask.to(self.device),
            )
            
            # Decode outputs
            outputs = outputs[:, inputs.input_ids.shape[-1]:]
            answers = self.processor.batch_decode(outputs, skip_special_tokens=True)
            
            return answers
        
        except Exception as e:
            return str(e)

if __name__ == "__main__":
    # Paths
    EMU_HUB = "BAAI/Emu3-Chat"
    VQ_HUB = "BAAI/Emu3-VisionTokenizer"
    image_path = "assets/demo.png"
    
    # Initialize and execute
    describer = Emu3ImageDescriber(EMU_HUB, VQ_HUB)
    descriptions = describer.describe_image(image_path)

    # Print results
    for desc in descriptions:
        print(desc)

未完......

更多详细的欢迎关注：杰哥新技术