MaskGIT：掩码生成式图像Transformer的革新

PDF文件

22.44MB | 更新于2025-01-16 | 96 浏览量 | 举报收藏

立即下载

"MaskGIT:掩码生成式图像Transformer" 掩码生成式图像Transformer，简称MaskGIT，是一种新型的深度学习模型，特别设计用于高分辨率图像的生成和编辑任务。该模型由Google Research的研究人员开发，它针对现有生成式Transformer模型在处理图像时存在的局限性进行了改进。传统的生成式Transformer模型通常将图像视为一维标记序列，按行顺序进行解码，但这种方法并不高效且可能限制模型的性能。 MaskGIT的核心创新在于采用双向Transformer解码器，这种设计允许模型在训练过程中预测随机掩码标记，并关注来自各个方向的标记信息。在推理阶段，模型能够一次性生成图像的全部标记，然后基于已经生成的部分逐步细化图像，提高了生成质量和效率。实验结果显示，MaskGIT在ImageNet数据集上的表现显著优于现有的先进Transformer模型，并且在自回归解码速度上提升了48倍，这意味着它能更快地生成高质量图像。除了图像合成，MaskGIT还展现出在各种图像编辑任务中的应用潜力，如图像修复、外推和操作。这些功能使得MaskGIT成为了一个灵活且强大的工具，能够适应不同的图像处理需求。例如，它可以用于条件类别图像生成，即根据指定的类别生成新的图像；图像操作，如替换图像中的特定对象；以及图像外推，即根据已有部分生成完整或扩展的图像。 MaskGIT的成功部分归功于Transformer架构的优秀特性，Transformer在自然语言处理领域的成功已经被证明，现在这一成功被扩展到了计算机视觉领域。尽管生成对抗网络(GANs)在高保真图像生成方面取得了显著成果，但它们的训练不稳定性、模式崩溃以及样本多样性的不足一直是研究挑战。MaskGIT的出现为解决这些问题提供了一个新视角，它有望进一步推动深度图像合成技术的发展。项目页面提供了更多的信息和示例，展示了MaskGIT在不同任务中的实际应用效果，感兴趣的读者可以访问masked-generative-image-transformer.github.io获取更多详细内容。MaskGIT的提出不仅改进了图像生成的效率，而且拓宽了Transformer模型在图像编辑和合成应用的边界，对于未来深度学习在图像处理领域的研究具有重要意义。

image extrapolation in arbitrary directions. Even though our

model is not designed for such tasks, it obtains comparable

performance to the dedicated models on each task.

2. Related Work

2.1. Image Synthesis

Deep generative models [12, 17, 29, 34, 40, 43, 44, 51]

have achieved lots of successes in image synthesis tasks.

GAN based methods demonstrate amazing capability in

yielding high-ﬁdelity samples [4, 17, 27, 42, 51]. In con-

trast, likelihood-based methods, such as Variational Au-

toencoders (VAEs) [29, 43], Diffusion Models [12, 24, 40]

and Autoregressive Models [34, 44], offer distribution cov-

erage and hence can generate more diverse samples [40, 43,

44].

However, maximizing likelihood directly in pixel space

can be challenging. So instead, VQVAE [37, 45] proposes

to generate images in latent space in two stages. In the ﬁrst

stage, which is known as tokenization, it tries to compress

images into discrete latent space, and primarily consists of

three components:

• an encoder E that learns to tokenize images x P

HˆW ˆ3

into latent embedding Epxq,

• a codebook e

P R

, k P 1, 2, ¨ ¨ ¨ , K which serves

for a nearest neighbor look up used to quantize the em-

bedding into visual tokens, and

• a decoder G which predicts the reconstructed image ˆx

from the visual tokens e.

In the second stage, it ﬁrst predicts the latent priors of the

visual tokens using deep autoregressive models, and then

uses the decoder from the ﬁrst stage to map the token se-

quences into image pixels. Several approaches have fol-

lowed this paradigm due to the efﬁcacy of the two-stage

approach. DALL-E [35] uses Transformers [46] to improve

token prediction in the second stage. VQGAN [15] adds ad-

versarial loss and perceptual loss [26,52] in the ﬁrst stage to

improve the image ﬁdelity. A contemporary work to ours,

VIM [49], proposes to use a VIT backbone [13] to further

improve the tokenization stage. Since these approaches still

employ an auto-regressive model, the decoding time in the

second stage scales with the token sequence length.

2.2. Masked Modeling with Bi-directional Trans-

formers

The transformer architecture [46], was ﬁrst proposed in

NLP, and has recently extended its reach to computer vi-

sion [6, 13]. Transformer consists of multiple self-attention

layers, allowing interactions between all pairs of elements

in the sequence to be captured. In particular, BERT [11]

introduces the masked language modeling (MLM) task for

language representation learning. The bi-directional self-

attention used in BERT [11] allows the masked tokens in

Input

Encoder

Bidirectional

Transformer

Reconstruction

Decoder





Masked Tokens

Predicted Tokens



Visual Tokens

Figure 3. Pipeline Overview. MaskGIT follows a two-stage de-

sign, with 1) a tokenizer that tokenizes images into visual tokens,

and 2) a bidirectional tranformer model that performs MVTM, i.e.

learns to predict visual tokens masked at random.

MLM to be predicted utilizing context from both directions.

In vision, the masked modeling in BERT [11] has been ex-

tended to image representation learning [2,21] with images

quantized to discrete tokens. However, few works have suc-

cessfully applied the same masked modeling to image gen-

eration [53] because of the difﬁculty in performing autore-

gressive decoding using bi-directional attentions. To our

knowledge, this paper provides the ﬁrst evidence demon-

strating the efﬁcacy of masked modeling for image gener-

ation on the common ImageNet benchmark. Our work is

inspired by bi-directional machine translation [16, 19, 20]

in NLP, and our novelty lies in the proposed new masking

strategy and decoding algorithm which, as substantiated by

our experiments, are essential for image generation.

11317

3.方法

我们的目标是设计一种利用并行解码和双向生成的新的图像

合成范式。我们遵循2.1中讨论的两阶段方法，如图3所示。

由于我们的目标是改进第二阶段，我们采用与VQGAN模型[

15]相同的设置作为第一阶段，并将潜在的改进留给将来的

工作。对于第二阶段，我们提出通过掩码视觉令牌建模（M

VTM）来学习一个双向变压器。我们在3.1中介绍了MVTM

训练和3.2中的采样过程。然后我们在3.3中讨论掩码设计的

关键技术。

3.1.训练中的MVTM

令Y“ryisNi“1表示将图像输入VQ编码器得到的潜在令牌

，其中N是重塑的令牌矩阵的长度，M“rmisNi“1是相应

的二进制掩码。在训练过程中，我们随机采样一部分令牌，

并用特殊的[MASK]令牌替换它们。如果mi“1，则令牌yi被

[MASK]替换，否则当mi“0时，yi将保持不变。采样过程由

掩码计划函数γ参数化，执行如下：

剩余10页未读，继续阅读

cpongm

粉丝: 6

MaskGIT：掩码生成式图像Transformer的革新

深入探索GPT模型：自然语言处理领域的生成式预训练转换器

多头注意力机制在生成式AI中的应用：解锁文本生成与图像合成

生成模型中的注意力机制：提升文本生成和图像合成质量的秘诀

XLNet如何自动生成文本：探索生成式文本任务的应用

Swin Transformer：解读Transformer的Masked Self-Attention

多模态Transformer在社交媒体的变革：内容理解与生成的飞跃

多头注意力机制在大型语言模型中的应用：赋能生成式AI

使用Mask2Former可以替代sam进行图像分割掩码生成吗

基于transformer的生成式

MIM（掩码图像建模）是什么

最新资源