
image extrapolation in arbitrary directions. Even though our
model is not designed for such tasks, it obtains comparable
performance to the dedicated models on each task.
2. Related Work
2.1. Image Synthesis
Deep generative models [12, 17, 29, 34, 40, 43, 44, 51]
have achieved lots of successes in image synthesis tasks.
GAN based methods demonstrate amazing capability in
yielding high-fidelity samples [4, 17, 27, 42, 51]. In con-
trast, likelihood-based methods, such as Variational Au-
toencoders (VAEs) [29, 43], Diffusion Models [12, 24, 40]
and Autoregressive Models [34, 44], offer distribution cov-
erage and hence can generate more diverse samples [40, 43,
44].
However, maximizing likelihood directly in pixel space
can be challenging. So instead, VQVAE [37, 45] proposes
to generate images in latent space in two stages. In the first
stage, which is known as tokenization, it tries to compress
images into discrete latent space, and primarily consists of
three components:
• an encoder E that learns to tokenize images x P
R
HˆW ˆ3
into latent embedding Epxq,
• a codebook e
k
P R
D
, k P 1, 2, ¨ ¨ ¨ , K which serves
for a nearest neighbor look up used to quantize the em-
bedding into visual tokens, and
• a decoder G which predicts the reconstructed image ˆx
from the visual tokens e.
In the second stage, it first predicts the latent priors of the
visual tokens using deep autoregressive models, and then
uses the decoder from the first stage to map the token se-
quences into image pixels. Several approaches have fol-
lowed this paradigm due to the efficacy of the two-stage
approach. DALL-E [35] uses Transformers [46] to improve
token prediction in the second stage. VQGAN [15] adds ad-
versarial loss and perceptual loss [26,52] in the first stage to
improve the image fidelity. A contemporary work to ours,
VIM [49], proposes to use a VIT backbone [13] to further
improve the tokenization stage. Since these approaches still
employ an auto-regressive model, the decoding time in the
second stage scales with the token sequence length.
2.2. Masked Modeling with Bi-directional Trans-
formers
The transformer architecture [46], was first proposed in
NLP, and has recently extended its reach to computer vi-
sion [6, 13]. Transformer consists of multiple self-attention
layers, allowing interactions between all pairs of elements
in the sequence to be captured. In particular, BERT [11]
introduces the masked language modeling (MLM) task for
language representation learning. The bi-directional self-
attention used in BERT [11] allows the masked tokens in
Input
Encoder
VQ
Bidirectional
Transformer
Reconstruction
Decoder
Masked Tokens
Predicted Tokens
Visual Tokens
Figure 3. Pipeline Overview. MaskGIT follows a two-stage de-
sign, with 1) a tokenizer that tokenizes images into visual tokens,
and 2) a bidirectional tranformer model that performs MVTM, i.e.
learns to predict visual tokens masked at random.
MLM to be predicted utilizing context from both directions.
In vision, the masked modeling in BERT [11] has been ex-
tended to image representation learning [2,21] with images
quantized to discrete tokens. However, few works have suc-
cessfully applied the same masked modeling to image gen-
eration [53] because of the difficulty in performing autore-
gressive decoding using bi-directional attentions. To our
knowledge, this paper provides the first evidence demon-
strating the efficacy of masked modeling for image gener-
ation on the common ImageNet benchmark. Our work is
inspired by bi-directional machine translation [16, 19, 20]
in NLP, and our novelty lies in the proposed new masking
strategy and decoding algorithm which, as substantiated by
our experiments, are essential for image generation.
11317
0
3.方法
0
我们的目标是设计一种利用并行解码和双向生成的新的图像
合成范式。我们遵循2.1中讨论的两阶段方法,如图3所示。
由于我们的目标是改进第二阶段,我们采用与VQGAN模型[
15]相同的设置作为第一阶段,并将潜在的改进留给将来的
工作。对于第二阶段,我们提出通过掩码视觉令牌建模(M
VTM)来学习一个双向变压器。我们在3.1中介绍了MVTM
训练和3.2中的采样过程。然后我们在3.3中讨论掩码设计的
关键技术。
0
3.1.训练中的MVTM
0
令Y“ryisNi“1表示将图像输入VQ编码器得到的潜在令牌
,其中N是重塑的令牌矩阵的长度,M“rmisNi“1是相应
的二进制掩码。在训练过程中,我们随机采样一部分令牌,
并用特殊的[MASK]令牌替换它们。如果mi“1,则令牌yi被
[MASK]替换,否则当mi“0时,yi将保持不变。采样过程由
掩码计划函数γ参数化,执行如下: