0% found this document useful (0 votes)
47 views25 pages

Shift-Equivariant Vision Transformers

Uploaded by

rafsun.sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views25 pages

Shift-Equivariant Vision Transformers

Uploaded by

rafsun.sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Making Vision Transformers Truly Shift-Equivariant

Renan A. Rojas-Gomez⋆ Teck-Yian Lim⋆ Minh N. Do⋆ Raymond A. Yeh†



University of Illinois at Urbana-Champaign

Purdue University
{renanar2, tlim11, minhdo}@illinois.edu [email protected]
arXiv:2305.16316v2 [cs.CV] 28 Nov 2023

Abstract of each module. This includes redesigning the tokenization,


self-attention, patch merging, and positional encoding mod-
For computer vision, Vision Transformers (ViTs) have be- ules. Our design principle is to perform an alignment that is
come one of the go-to deep net architectures. Despite being dependent on the input signal, i.e., each module’s behavior
inspired by Convolutional Neural Networks (CNNs), ViTs’ is adaptive to the input, hence we prepend each module’s
output remains sensitive to small spatial shifts in the input, name with (A)daptive. All our adaptive modules are provably
i.e., not shift invariant. To address this shortcoming, we intro- shift-equivariant and realizable in practice.
duce novel data-adaptive designs for each of the modules in Our adaptive design enables truly shift-invariant ViTs for
ViTs, such as tokenization, self-attention, patch merging, and image classification and truly shift-equivariant ViTs for se-
positional encoding. With our proposed modules, we achieve mantic segmentation, i.e., achieving 100% shift consistency.
true shift-equivariance on four well-established ViTs, namely, We conduct extensive experiments on CIFAR10/100 [22],
Swin, SwinV2, CvT, and MViTv2. Empirically, we evaluate and ImageNet for classification; as well as ADE20K [56]
the proposed adaptive models on image classification and for semantic segmentation. We empirically show that our
semantic segmentation tasks. These models achieve com- architectures improve shift consistency and have competi-
petitive performance across three different datasets while tive performance on four well-established ViTs: Swin [28],
maintaining 100% shift consistency. SwinV2 [29], CvT [49], and MViTv2 [24].
Our contributions are as follows:
• We propose a family of ViT modules that are provably
1. Introduction circularly shift-equivariant.
Vision Transformers (ViTs) [13, 14, 24, 28, 29, 49] have • With these proposed modules, we build truly shift-invariant
become a strong alternative to convolutional neural networks and equivariant ViTs, achieving 100% circular shift-
(CNNs) in computer vision, superseding their dominance in consistency measured from end-to-end.
image classification and becoming the state-of-the-art model • Extensive experiments on image classification and seman-
on ImageNet [11]. Unlike the original Transformer [45] pro- tic segmentation demonstrate the effectiveness of our ap-
posed for natural language processing (NLP), ViTs incorpo- proach in terms of performance and shift consistency.
rate suitable inductive biases for computer vision. Consider 2. Related Work
image classification, where an input shift does not change
the underlying image label, i.e., the task is shift-invariant. We briefly discuss ViTs and shift-equivariant CNNs. Addi-
Several ViTs accredited shift-invariance as the motivation tional background concepts are reviewed in Sec. 3.
for the proposed architecture. For instance, Wu et al. [49] Vision transformers. Proposed by Vaswani et al. [45] for
state that their ViT model brings “desirable properties of NLP tasks, the Transformer architecture incorporates a to-
CNNs to the ViT architecture (i.e. shift, scale, and distortion kenizer, positional encoding, and attention mechanism into
invariance).” Similarly, Liu et al. [28] found that “induc- one architecture. Soon after, the Transformer found its suc-
tive bias that encourages certain translation invariance is cess in computer vision by incorporating suitable inductive
still preferable for general-purpose visual modeling.” Sur- biases, such as shift equivariance, giving rise to the area of Vi-
prisingly, despite these proposed architecture changes, ViTs sion Transformers. Seminal works include: ViT [13], which
remain sensitive to spatial shifts. This motivates us to study treats an image as 16 × 16 tokens; Swin [28, 29], which in-
how to make ViTs truly shift-invariant and equivariant. troduces locally shifted windows to the attention mechanism;
In this work, we carefully re-examine each of the building CvT [49], which introduces convolutional layers to ViTs;
blocks in ViTs and propose novel shift-equivariant versions and MViT [14, 24], which introduces a multi-scale pyramid

1
(a) Original tokenization (token). (b) Proposed adaptive tokenization (A-token).

Figure 1. Re-designing ViT’s tokenization towards shift-equivariance: (a) The original patch embedding is sensitive to small input shifts
due to the fixed grid used to split an image into patches. (b) Our adaptive tokenization A-token is a generalization that consistently selects
the group of patches with the highest energy, despite input shifts.

structure. A more in-depth discussion on ViTs can be found equivariance. For readability, the concepts are described in
in recent surveys [15, 18]. In this work, we re-examine ViTs’ 1D. In practice, these are extended to multi-channel images.
modules and present a novel adaptive design that enables Equivariance. Conceptually, equivariance describes how a
truly shift-equivariant ViTs. function’s input and output are related to each other under
Concurrently on ArXiv, Ding et al. [12] propose a a predefined set of transformations. For example, in image
polyphase anchoring method to obtain circular shift invariant segmentation, shift equivariance means that shifting the input
ViTs in image classification. Differently, our method demon- image results in shifting the output mask. In our analysis, we
strates improvements in circular and linear shift consistency consider circular shifts and denote them as
for both image classification and semantic segmentation
SN x [n] = x[(n + 1) mod N ], x ∈ RN

while maintaining competitive task performance. (1)
Invariant and equivariant CNNs. Prior works [1, 55] have
shown that modern CNNs [17, 23, 40, 43] are not shift- to ensure that the shifted signal x remains within its sup-
equivariant due to the usage of pooling layers. To improve port. Following Rojas-Gomez et al. [37], we say a func-
shift-equivariance, anti-aliasing [47] is introduced before tion f : RN 7→ RM is SN , {SM , I}-equivariant, i.e., shift-
downsampling. Specifically, Zhang [55] and Zou et al. [57] equivariant, iff ∃ S ∈ {SM , I} s.t.
propose to use a low-pass filter (LPF) for anti-aliasing.
f (SN x) = Sf (x) ∀x ∈ RN (2)
While anti-aliasing improves shift-equivariance, the over-
all CNN remains not truly shift-equivariant. To address where I denotes the identity mapping. This definition care-
this, Chaman and Dokmanic [3] propose Adaptive Polyphase fully handles the case where N > M . For instance, when
Sampling (APS), which selects the downsampling indices downsampling by a factor of two, an input shift by one
based on the ℓ2 norm of the input’s polyphase compo- should ideally induce an output shift by 0.5, which is not
nents. Rojas-Gomez et al. [37] then improve APS by propos- realizable on the integer grid. This 0.5 has to be rounded up
ing a learnable downsampling module (LPS) that is truly or down, hence a shift SM or a no-shift I, respectively.
shift-equivariant. Michaeli et al. [32] propose the use of Invariance. For classification, a label remains unchanged
polynomial activations to obtain shift-invariant CNNs. In when the image is shifted, i.e., it is shift-invariant. A function
contrast, we present a family of modules that enables truly f : RN 7→ RM is SN , {I}-equivariant (invariant) iff
shift-equivariant ViTs. We emphasize that CNN methods are
not applicable to ViTs due to their distinct architectures. f (SN x) = f (x) ∀x ∈ RN . (3)
Beyond the scope of this work, general equivariance [2,
4, 20, 35, 38, 39, 41, 44, 46, 48, 53] have also been studied. A common way to design a shift-invariant function under
Equivariant networks have also been applied to sets [16, 31, circular P
shifts is via global spatial pooling [25], defined as
34, 36, 51, 54], graphs [9, 10, 19, 26, 27, 30, 33, 42, 52], g(x) = m x[m]. Given a shift-equivariant function f :
spherical images [5, 6, 21], etc. X X X
f (SN x)[m] = Sf (x)[m] = f (x)[m]. (4)
3. Preliminaries m m m

We review the basics before introducing our approach, However, note that ViTs using global spatial pooling after
including the aspects of current ViTs that break shift- extracting features are not shift-invariant, as preceding layers

2
(a) Window-based self-attention (WSA) (b) Proposed adaptive window-based self-attention (A-WSA)

Figure 2. Re-designing window-based self-attention towards shift-equivariance: (a) The window-based self-attention WSA breaks shift
equivariance by selecting windows without considering their input properties. (b) Our proposed adaptive window-based self-attention selects
the best grid of windows based on their average energy, obtaining windows comprised of the same tokens despite input shifts.

(k) ⊤
∈ RW ×D de-

such as tokenization, window-based self-attention, and patch where T̄W = TW k . . . TW (k+1)−1
merging are not shift-equivariant, which we review next. th
notes the k window comprised by W neighboring tokens
Tokenization (token). ViTs split an input x ∈ RN into (W consecutive rows of T ). Note that Eq. (9) uses semi-
non-overlapping patches of length L and project them into a colons (;) as row separators.
latent space to generate tokens. Swin [28, 29] architectures take advantage of WSA to
N decrease the computational cost while adopting a shifting
token(x) = XE ∈ R L ×D (5) scheme (at the window level) to allow long-range connec-
tions. We note that WSA is not shift-equivariant, e.g., any
where E ∈ RL×D is a linear projection and X =
h i⊤ shift that is not a multiple of the window size changes the
N
reshape(x) = X0 . . . X NL −1 ∈ R L ×L is a ma- tokens within each window, as illustrated in Fig. 2a.
trix where each row corresponds to the k th patch of x, i.e., Patch merging (PMerge). Given input tokens T =
⊤
T0 . . . TM −1 ∈ RM ×D and a patch length P , patch

Xk = x[Lk : L(k + 1) − 1] ∈ RL . (6) merging is defined as a linear projection of vectorized token
patches:
Eq. (5) implies that token is not shift-equivariant. Patches
are extracted based on a fixed grid, so different patches will M
PMerge(T ) = T̃ Ẽ ∈ R P ×D̃ (10)
be obtained if the input is shifted, as illustrated in Fig. 1a. h M
i ⊤
Self-Attention (SA). In ViTs, self-attention is defined as with T̃ = vec(T̄P(0) ) . . . vec(T̄P( P −1) ) .
√ ′
SA(T ) = softmax(QK ⊤ / D′ )V ∈ RM ×D (7)
(k)
Here, vec(T̄P ) ∈ RP D is the vectorized version of the
⊤ ⊤
∈ RM ×D denotes input (k)

where T = T0 . . . TM −1 k th patch T̄P = TP k . . . TP (k+1)−1 ∈ RP ×D , and

tokens, and softmax is the softmax normalization along Ẽ ∈ RP D×D̃ is a linear projection.
rows. Queries Q, keys K and values V correspond to: PMerge reduces the number of tokens after applying
Q = T EQ, K = T EK , V = T EV (8) self-attention while increasing the token length, i.e., D̃ > D.
This is similar to the CNN strategy of increasing the number

Q/K/V
∈ RD×D . The term of channels using convolutional layers while decreasing their
with linear projections√ E

softmax QK / D ∈ [0, 1] ′ M ×M
ensures that the out- spatial resolution via pooling. Since patches are selected on
put token is a convex combination of the computed values. a fixed grid, PMerge is not shift-equivariant.
Window-based self-attention (WSA). A crucial limitation of Relative position embedding (RPE). As self-attention is
self-attention is its quadratic computational cost with respect permutation equivariant, spatial information must be explic-
to the number of input tokens M . To alleviate this, window- itly added to the tokens. Typically, RPE adds a position
based self-attention [28] groups tokens into local windows matrix representing the relative distance between queries
and then performs self-attention within each window. Given and keys into self-attention as follows:
input tokens T ∈ RM ×D and a window size W , window-
QK ⊤
 

based self-attention WSA(T ) ∈ RM ×D is defined as: SA(rel) (T ) = softmax √ + E (rel) V (11)
h i D′
(0)  ( M −1)  (Q) (K)
WSA(T ) = SA T̄W ; . . . ; SA T̄WW (9) with E (rel) [i, j] = B (rel) [pi − pj ]. (12)

3
Here, E (rel) ∈ RM ×M is constructed from an embedding we show that the remainder determines the relation between
(Q) (K)
lookup table B (rel) ∈ R2M −1 and the index [pi − pj ] the L token representations of an input and its shifted version,
denotes the distance between the ith query token at position while the quotient causes a circular shift in the token index.
(Q) (K)
pi and the j th key token at position pj . As embeddings The complete proof is deferred to Appendix Sec. A1.
are selected based on the relative distance, RPE allows ViTs Lemma 1 shows that, for any index m, there exists m̂ =
to capture spatial relationships, e.g., knowing whether two (m + 1) mod L such that X (m) and X̂ (m̂) are equal up to a
tokens are spatially nearby. circular shift. In Claim 1, we use this property to demonstrate
the shift-equivariance of our proposed adaptive tokenization.
4. Truly Shift-equivariant ViT
Claim 1. Shift-equivariance of adaptive tokenization.
In the previous section, we identified components of ViTs If F in Eq. (14) is shift-invariant, then A-token is
that are not shift-equivariant. To achieve shift-equivariance in shift-equivariant, i.e., ∃ mq ∈ {0, . . . , L − 1} s.t.
ViTs, we redesign four modules: tokenization, self-attention,  mq
patch merging, and positional embedding. As equivariance A-token SN x = S⌊N/L⌋ A-token(x). (16)
is preserved under compositions, ViTs using these modules
are end-to-end shift-equivariant. Proof. Given m⋆ in Eq. (14), Lemma 1 asserts the existence

Adaptive tokenization (A-token). Standard tokenization of m̂ such that X̂ (m̂) E = X (m ) E up to a circular shift.
splits an input into patches using a regular grid, breaking Since x and SN x have the same L token representations and
shift-equivariance. We propose a data-dependent alternative assuming a shift-invariant F , we show that A-token(SN x)
that selects patches that maximize a shift-invariant function, is equal to X̂ (m̂) E, which is a circularly shifted version of
resulting in the same tokens regardless of input shifts. A-token(x). See Appendix Sec. A1 for the full proof.
Given an input x ∈ RN and a patch length L, Our adap- Adaptive window-based self-attention (A-WSA). WSA’s
tive tokenization is defined as window partitioning is shift-sensitive, as different windows
⋆ N
A-token(x) = X (m ) E ∈ R L ×D (13) are obtained when the input tokens are circularly shifted by
⋆ (m)
a non-multiple of the window size. Instead, we propose an
with m = arg max F (X E). (14) adaptive token shifting method to obtain a consistent window
m∈{0,...,L−1}
partition. By selecting the offset based on the energy of each
N
Here, X (m) = reshape(SN m
x) ∈ R L ×L is the reshaped possible partition, our method generates the same windows
version of the input circularly shifted by m samples, E ∈ regardless of input shifts.
N ⊤
RL×D is a linear projection and F : R L ×D 7→ R is a shift- Given input tokens T = T0 . . . TM −1 ∈ RM ×D

M
invariant function. Note that the token representation of an and a window size W , let vW ∈ R⌊ W ⌋ denote the average
input is only affected by circular shifts up to the patch size ℓp -norm (energy) of each token window:
L. For any shift greater than L − 1, there is a shift smaller
W −1
than L that generates the same tokens (up to a circular shift). 1 X
So, an input has L different token representations. Fig. 1b vW [k] = ∥T(W k+l) mod M ∥p . (17)
W
l=0
illustrates our proposed shift-equivariant tokenization.
Our adaptive tokenization maximizes a shift-invariant Then, the energy of the windows resulting from shifting the
(m) M
function to ensure the same token representation regardless input tokens by m indices corresponds to vW ∈ R⌊ W ⌋ ,
of input shifts. Next, we analyze an essential property of (m)
where vW [k] is the energy of the k th window:
X (m) E to prove that A-token is shift-equivariant.
W −1
(m) 1 X m
Lemma 1. L-periodic shift-equivariance of tokenization. vW [k] = ∥(SM T )(W k+l) mod M ∥p . (18)
W
Let input x ∈ RN have a token representation l=0
| {z
=T(W k+m+l)
}
mod M
X E ∈ R⌊N/L⌋×D . If x̂ = SN x (a shifted input),
(m)

then its token representation X̂ (m) E corresponds to: Based on the window energy in Eq. (18), we define the
adaptive window-based self-attention as
⌊(m+1)/L⌋
X̂ (m) E = S⌊N/L⌋ X ((m+1) mod L) E (15) m⋆ ′
T ∈ RM ×D

A-WSA(T ) = WSA SM (19)
⋆ (m) 
This implies that x and x̂ are characterized by the same with m = arg max G vW (20)
L token representations, up to a circular shift along the m∈{0,...,W −1}

token index (row index of X ((m+1) mod L) E). M


where G : R ⌋ 7→ R is a shift-invariant function. By
⌊W

Proof. By definition, X̂ (m)


= m+1
reshape(SN x).
Ex- choosing windows based on m⋆ , A-WSA generates the same
pressing m + 1 in quotient and remainder for divisor L, group of windows despite input shifts, as shown in Claim 2.

4
Claim 2. If G in Eq. (19) is shift invariant, then A-WSA to encode the distance between the ith query token at po-
(Q) (K)
is shift-equivariant. sition pi and the j th key token at position pj . Here,
B (adapt) ∈ RM is the trainable lookup table comprised by
Proof. Given two groups of tokens related by a circular shift, relative positional embeddings. Note that B (adapt) is smaller
and a shift-invariant function G, shifting each group by its than the original B (rel) ∈ R(2M −1) , since relative distances
maximizer in Eq. (19) induces an offset that is a multiple of are now measured in a circular fashion between M tokens.
W . So, both groups are partitioned in the same windows up Segmentation with equivariant upsampling. Segmentation
to a circular shift. Fig. 2b illustrates this consistent window models with ViT backbones continue to use CNN decoders,
grouping. The proof is deferred to Appendix Sec. A1. e.g., Swin [28] uses UperNet [50]. As explained by Rojas-
Gomez et al. [37], to obtain a shift-equivariant CNN decoder,
Adaptive patch merging (A-PMerge). As reviewed the key is to keep track of the downsampling indices and
in Sec. 3, PMerge consists of a vectorization of P neighbor- use them to put features back to their original positions
ing tokens followed by a projection from RP D to RD̃ . So, during upsampling. Different from CNNs, the proposed ViTs
it can be expressed as a strided convolution with D̃ output involve data-adaptive window selections, i.e., we would also
channels, stride factor P and kernel size P . We use this prop- need to keep track of the window indices and account for
erty to propose a shift-equivariant patch merging. their shifts during upsampling.
Claim 3. PMerge corresponds to a strided convolution
with D̃ output channels, striding P and kernel size P . 5. Experiments
Proof. Expressing the linear projection Ẽ as a convolutional We conduct experiments on image classification and se-
matrix, PMerge is equivalent to a convolution sum with mantic segmentation on four ViT architectures, namely,
kernels comprised by columns of Ẽ. Let
 the input tokens Swin [28], SwinV2 [29], CvT [49], and MViTv2 [24]. For
be expressed as T = t0 . . . tD−1 ∈ RM ×D , where

each task, we analyze their performance under circular and
tj ∈ RM corresponds to the j th element of every input token. standard shifts. For circular shifts, the theory matches the
M
Then, PMerge(T ) ∈ R P ×D̃ can be expressed as: experiments and our approach achieves 100% circular shift
consistency (up to numerical errors). We further conduct
PMerge(T ) = D(P ) ( y0 . . . yD̃−1 )
 
(21) experiments on standard shifts to study how our method per-
D−1
X forms under this theory-to-practice gap, where there is loss
with yk = tj ⊛ h(k,j) ∈ RM (22) of information at the boundaries.
j=0

M 5.1. Image classification under circular shifts


where D(P ) ∈ R P ×M is a striding operator of factor P , ⊛
denotes circular convolution and h(k,j) ∈ RP is a kernel. Experiment setup. We conduct experiments on CIFAR-
Details are deferred to Appendix Sec. A1. 10/100 [22], and ImageNet [11]. For all datasets, images are
Following Claim 3, to attain shift-equivariance, the adap- resized to the resolution used by each model’s original imple-
tive polyphase sampling (APS) by Chaman and Dokmanic mentation (224 × 224 for Swin-T, CVT-13, and MViTv2-T;
[3] is adopted as the striding operator. Let APS(P ) denote 256 × 256 for SwinV2-T). Using the default image size
the adaptive polyphase sampling layer of striding factor P . allows us to use the same architecture across all datasets,
M
Then, A-PMerge(T ) ∈ R P ×D̃ corresponds to: i.e., everything follows the original number of layers and
blocks. To avoid boundary conditions, circular padding is
A-PMerge(T ) = APS(P ) ( y0 . . . yD̃−1 )
 
(23) used in all convolutional layers, and circular shifts are used
for evaluating shift consistency.
While PMerge selects M P embeddings by striding, shift- On CIFAR-10/100, all models were trained for 100
equivariance is achieved by adaptively choosing the best M P epochs on two GPUs with batch size 48. The scheduler set-
tokens based on their ℓ2 norm. tings of each model were scaled accordingly. On ImageNet,
Adaptive RPE. While the original relative distance matrix all models were trained for 300 epochs on eight GPUs using
E (rel) is computed by taking into account linear shifts, this their default batch sizes. Refer to Sec. A3 for full experi-
does not match our circular shift assumption; See Fig. 3 for mental details. For CIFAR-10/100, we report average and
a visualization. To obtain perfect shift-equivariance, relative standard deviation metrics over five seeds. Due to computa-
distances must consider the periodicity induced by circu- tional limitations, we report on a single seed for ImageNet.
lar shifts. Hence, we propose the adaptive relative position Evaluation metric. We report the classification top-1 accu-
matrix E (adapt) ∈ RM ×M defined as: racy on the original dataset without any shifts. To quantify
shift-invariance, we also report the circular shift consistency
h i
(Q) (K)
E (adapt) [i, j] = B (adapt) (pi − pj ) mod M (24)
(C-Cons.) which counts how often the predicted labels are

5
(a) Shifted tokens (b) Original relative distance (c) Proposed relative distance
Figure 3. Shift consistent relative distance: (a) Circularly shifted queries and keys (M = 4). (b) Original relative distance used to build the
RPE matrix: p(Q) [i] − p(K) [j]. Since it does not consider the periodicity of circular shifts, relative distances are not preserved. (c) Proposed
relative distance: (p(Q) [i] − p(K) [j])mod M . Our proposed distance is consistent with circular shifts, leading to a shift equivariant RPE.

Circular Shift Standard Shift


Method CIFAR10 CIFAR-100 CIFAR10 CIFAR-100
Top-1 Acc. C-Cons. Top-1 Acc. C-Cons. Top-1 Acc. S-Cons. Top-1 Acc. S-Cons.
Swin-T 90.15 ± .18 83.30 ± .61 71.01 ± .27 65.32 ± .69 90.11 ± .21 86.35 ± .25 71.12 ± .14 69.39 ± .52
A-Swin-T (Ours) 93.39 ± .12 99.99 ± .01 75.11 ± .10 99.99 ± .01 93.50 ± .19 96.00 ± .08 75.12 ± .28 87.70 ± .57
SwinV2-T 89.08 ± .21 89.16 ± .08 69.78 ± .22 75.23 ± .20 89.08 ± .21 91.68 ± .25 69.67 ± .32 80.42 ± .41
A-SwinV2-T (Ours) 91.64 ± .21 99.99 ± .01 72.73 ± .23 99.96 ± .01 91.91 ± .12 95.81 ± .17 72.98 ± .13 88.74 ± .40
CvT-13 90.06 ± .23 75.80 ± 1.2 66.61 ± .33 50.29 ± 1.68 90.05 ± .20 84.66 ± 1.26 66.06 ± .39 63.03 ± .73
A-CvT-13 (Ours) 93.87 ± .14 100 ± .00 76.19 ± .32 100 ± .00 93.71 ± .10 96.47 ± .21 73.04 ± .23 86.96. ± .55
MViTv2-T 96.00 ± .06 86.55 ± 1.2 80.18 ± .34 74.82 ± .73 96.14 ± .06 91.34. ± 1.26 80.28 ± .38 77.92. ± .93
A-MViTv2-T (Ours) 96.41 ± .22 100 ± .00 81.39 ± .11 100 ± .00 96.61 ± .11 98.36. ± .16 81.17 ± .18 92.95. ± .16

Table 1. CIFAR-10/100 classification results: Top-1 accuracy and shift consistency under circular and linear shifts. Bold numbers indicate
improvement over the corresponding baseline architecture. Mean and standard deviation reported on five randomly initialized models.

identical under two different circular shifts. Given a dataset Circular Shift Standard Shift
Method
D = {I}, C-Cons. computes Top-1 Acc. C-Cons. Top-1 Acc. S-Cons.
1 X h  i Swin-T 78.5 86.68 81.18 92.41
E∆1 ,∆2 1 ŷ(S ∆1 (I)) = ŷ(S ∆2 (I)) (25) A-Swin-T (Ours) 79.35 99.98 81.6 93.24
|D|
I∈D SwinV2-T 78.95 87.68 81.76 93.24
where 1 denotes the indicator function, ŷ(I) the class pre- A-SwinV2-T (Ours) 79.91 99.98 82.10 94.04
diction for I, S the circular shift operator, and ∆1 = CvT-13 77.01 86.87 81.59 92.80
(h1 , w1 ), ∆2 = (h2 , w2 ) horizontal and vertical offsets. A-CvT-13 (Ours) 77.05 100 81.48 93.41
MViTv2-T 77.36 90.03 82.21 93.88
Results. We report performance in Tab. 1 and Tab. 2 for
A-MViTv2-T (Ours) 77.46 100 82.4 94.08
CIFAR-10/100 and ImageNet, respectively. Overall, we ob-
serve that our adaptive ViTs achieve near 100% shift con- Table 2. ImageNet classification results: Top-1 accuracy and shift
sistency in practice. The remaining inconsistency is caused consistency under circular and linear shifts. Bold numbers indicate
by numerical precision limitations and tie-breaking leading improvement over the corresponding baseline architecture.
to a different selection of tokens or shifted windows. Be-
yond consistency improvements, our method also improves on the original dataset (without any shifts). To quantify
classification accuracy across all settings. shift-invariance, we report the standard shift consistency
(S-Cons.), which follows the same principle as C-Cons
5.2. Image classification under standard shifts in Eq. (25), but uses a standard shift instead of a circular
Experiment setup. To study the boundary effect on shift- one. For CIFAR-10/100, we use zero-padding at the bound-
invariance, we further conduct experiments using standard aries. due to the small image size. For ImageNet, follow-
shifts. As these are no longer circular, the image content ing Zhang [55], we perform an image shift followed by a
may change at its borders, i.e., perfect shift consistency center-cropping of size 224 × 224. This produces realistic
is no longer guaranteed. For CIFAR10/100, input images shifts and avoids a particular choice of padding.
were resized to the resolution used by each model’s original Results. Tabs. 1 and 2 report performance under standard
implementation. Default data augmentation and optimizer shifts on CIFAR-10/100 and ImageNet, respectively. Due
settings were used for each model while training epochs and to changes in the boundary content, our method does not
batch size followed those used in the circular shift settings. achieve 100% shift consistency. However, we persistently
Evaluation metric. We report top-1 classification accuracy observe that the adaptive models outperform their respective

6
Input CvT-13 A-CvT-13 (Ours) CvT-13 A-CvT-13 (Ours) CvT-13 A-CvT-13 (Ours)
(224 × 224) Block 1 Error (56 × 56) Block 1 Error (56 × 56) Block 2 Error (28 × 28) Block 2 Error (28 × 28) Block 3 Error (14 × 14) Block 3 Error (14 × 14)

Figure 4. Consistent token representations. Shifting inputs by a small offset leads to large deviations (non-zero errors) in the representations
when using default ViTs (e.g., CvT-13). In contrast, our proposed models (e.g., A-CvT-13) achieve an absolute zero-error across all blocks.

Throughput Relative Module Abs. runtime (ms) Delta (ms)


Model # Params
(images/s) change (%) Tokenization 8.37 −
Swin-T 28M 704.07 − A. Tokenization (Ours) 35.89 +27.52
A-Swin-T (Ours) 28M 633.35 10.04 Patch Merging {S2, S3, S4} 0.47, 0.45, 0.45 −
A. Patch Merging (Ours) 7.68, 4.47, 3.09 +7.21, +4.02, +2.64
SwinV2-T 28M 470.81 −
Window Selection Not applied −
A-SwinV2-T (Ours) 28M 405.01 13.98
A. Window Selection (Ours) 7.63 +7.63
CvT-13 20M 535.5 − RPE 2.84 −
A-CvT-13 (Ours) 20M 492.12 10.69 A. RPE (Ours) 9.91 +7.07
MViTv2-T 24M 439.5 −
A-MViTv2-T (Ours) 24M 352.06 19.9 Table 4. Runtime of adaptive ViT modules: Inference runtime of
our adaptive ViT modules and their default versions. Delta indicates
Table 3. Inference throughput: Absolute inference throughput the absolute time difference w.r.t. the default modules.
(images/s) of our adaptive ViTs and their default versions. Relative
change shows the throughput decrease w.r.t. the default models.
Tab. 3 shows our adaptive models exhibit less than a
20% decrease in throughput w.r.t. the default models, while
baselines in terms of S-Cons. Beyond shift consistency, our improving in shift consistency and classification accuracy
adaptive models achieve higher classification performance in without increasing the number of trainable parameters.
all settings except for CvT on ImageNet. Results demonstrate Modules runtime. We compare the runtime of our adaptive
the practical value of our approach despite the gap in theory. modules to that of the default ones. Tokenization, patch
merging and window selection are evaluated on A-Swin,
5.3. Consistency of tokens to input shifts
while RPE is evaluated on A-MViTv2 (A-Swin windows are
We evaluate the effect of small input shifts in the tokens comprised of the same tokens, so its RPE remains unaltered).
obtained by our adaptive models. We verify the stability of Tab. 4 shows the runtime of our adaptive modules, which
our A-CvT-13 model by applying a circular shift of 1 row slightly increases over the default runtime. This is particu-
and 1 column to the input image, computing its tokens, and larly true for the adaptive RPE, where the main difference
calculating their absolute difference to those of the unshifted lies in the distance interpretation (circular vs. linear). While
image. Fig. 4 shows the absolute token difference of an adaptive patch embedding has the largest increase by operat-
ImageNet test sample at all three blocks of A-CvT-13, each ing on full-size images, subsequent patch merging modules
with a different resolution. Similar to previous work [3], we operate on smaller representations and are more efficient.
illustrate errors for the channels with the highest energy.
In contrast to the large deviations across the default CvT- 5.5. Semantic segmentation under circular shifts
13 caused by the input shift, the representations generated by Experiment setup. We conduct semantic segmentation ex-
our proposed A-CvT-13 model remain unaltered, as theoreti- periments on the ADE20K dataset [56] using A-Swin and
cally shown, leading to a perfectly shift-equivariant model. A-SwinV2 models as backbones and compare them against
their default versions. Following previous work [28], we use
5.4. Throughput and runtime analysis
UperNet [50] as the segmentation decoder. Similar to our
We evaluate the inference throughput, measured in processed classification settings under circular shifts, all convolutional
images per second, of our adaptive ViTs and modules over layers in the UperNet model use circular padding to avoid
100 forward passes (batch size 128, default image size per boundary conditions, and circular shifts are used to measure
model) on a single NVIDIA Quadro RTX 5000 GPU. shift consistency. Models are trained for 160K iterations on
Model inference. We report the throughput of our adaptive a total batch size of 16 using the default augmentation.
ViTs and their default versions. We also measure the relative Evaluation metric. For segmentation performance, we re-
change, which corresponds to the throughput decrease with port the mean intersection over union (mIoU) on the original
respect to the default models. dataset (without any shifts). For shift-equivariance, we re-

7
Image Shifted Image SwinV2 + UperNet Predictions A-SwinV2 + UperNet Predictions (Ours)

Figure 5. Segmentation under standard shifts: Our A-SwinV2 + UperNet model improves robustness to input shifts over the original
model, generating consistent predictions while improving accuracy. Examples of prediction changes due to shifts are highlighted in red.

Circular Shift Standard Shift Configuration Top-1 Acc. C-Cons.


Backbone
mIoU mASCC mIoU mASSC A-Swin-T (Ours) 93.39 ± .13 100
Swin-T 42.93 87.32 44.2 93.37 (i) No A-token 93.66 ± .19 96.29 ± .20
A-Swin-T (Ours) 43.44 100 44.43 93.48 (ii) No A-WSA 93.24 ± .15 95.62 ± .54
(iii) No A-PMerge 91.67 ± .10 94.62 ± .11
SwinV2-T 43.86 88.16 44.26 93.23
A-SwinV2-T (Ours) 44.42 100 46.11 93.59 Swin-T (Default) 90.15 ± .18 83.30 ± .61
Table 5. Semantic segmentation performance: Segmentation Table 6. Ablation study: Effect of our shift-equivariant ViT mod-
accuracy and shift consistency of our adaptive UperNet model ules on classification accuracy and shift consistency. Configurations
equipped with A-Swin and A-SwinV2 backbones. progressively evaluated on Swin-T under circular shifts.

port the mean-Average Segmentation Circular Consistency performance and shift consistency, with a notable improve-
(mASCC) which counts how often the predicted pixel la- ment on SwinV2-T. See Fig. 5 for segmentation results.
bels (after shifting back) are identical under two different
circular shifts. Given a dataset D = {I}, mASCC computes 5.7. Ablation study
" H,W We study the impact of each proposed ViT module by remov-
1 X 1 X h
ing them. These include (i) adaptive tokenization, (ii) adap-
E∆1 ,∆2 1
|D| HW u=1,v=1 tive window-based self-attention, and (iii) adaptive patch
I∈D
i
# merging. We conduct the ablations on our A-Swin-T model
−∆1 ∆1 −∆2 ∆2 trained on CIFAR-10 under circular shifts.
S ŷ(S (I))[u, v] = S ŷ(S (I))[u, v] , (26)
Results are reported in Tab. 6. Our full model improves
shift consistency by more than 3.5% over A-Swin-T without
where H, W correspond to the image height and width, and
A-token, while slightly decreasing classification accuracy
[u, v] indexes the class prediction at pixel (u, v).
by 0.27%. The use of A-WSA improves both classification ac-
Results. Tab. 5 shows classification accuracy and shift con-
curacy and shift consistency. Finally, A-PMerge improves
sistency for UperNet segmenters using Swin-T and SwinV2-
classification accuracy by approximately 1.7% and shift con-
T backbones. Following the theory, our adaptive models
sistency by more than 5%. Overall, all the proposed modules
achieve 100% mASCC (perfect shift consistency), while
are necessary to achieve 100% shift consistency.
improving on segmentation accuracy.
5.6. Semantic segmentation under standard shifts 6. Conclusion
Experiment setup. As in the circular shift scenario, mod- We propose a family of ViTs that are truly shift-invariant
els are trained for 160K iterations with a total batch size of and equivariant. We redesigned four ViT modules, includ-
16 using the default data augmentation. To evaluate shift- ing tokenization, self-attention, patch merging, and posi-
equivariance under standard shifts, we report the mean- tional embedding to guarantee circular shift-invariance and
Average Semantic Segmentation Consistency (mASSC), equivariance theoretically. Using these modules, we made
which computes whether the percentage of predicted pixel Swin, SwinV2, CvT, and MViTv2 versions that are truly
labels (after shifting back) remains under two standard shifts. shift-equivariant. When matching the theoretical setup, these
Note, mASSC ignores the boundary pixels in its computation models exhibit 100% circular shift consistency and superior
as standard shifts lead to changes in boundary content. performance compared to baselines. Furthermore, on stan-
Results. Tab. 5 reports results on the standard shift scenario. dard shifts where image boundaries deviate from the theory,
Due to boundary conditions, perfect shift-equivariance is not our proposed models remain more resilient to shifts with
achieved. Nevertheless, our models improve segmentation task performance on par with/exceeding the baselines.

8
References [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
learning for image recognition. In Proc. CVPR, 2016.
[1] A. Azulay and Y. Weiss. Why do deep convolutional 2
networks generalize so poorly to small image transfor-
[18] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan,
mations? JMLR, 2019. 2
and M. Shah. Transformers in vision: A survey. ACM
[2] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and
CSUR, 2022. 2
P. Vandergheynst. Geometric deep learning: going
[19] T. N. Kipf and M. Welling. Semi-supervised classifica-
beyond euclidean data. IEEE SPM, 2017. 2
tion with graph convolutional networks. In Proc. ICLR,
[3] A. Chaman and I. Dokmanic. Truly shift-invariant
2017. 2
convolutional neural networks. In Proc. CVPR, 2021.
[20] D. M. Klee, O. Biza, R. Platt, and R. Walters. Image to
2, 5, 7, 19, 21
sphere: Learning equivariant features for efficient pose
[4] T. Cohen and M. Welling. Group equivariant convolu-
prediction. In Proc. ICLR, 2023. 2
tional networks. In Proc. ICML, 2016. 2
[21] R. Kondor, Z. Lin, and S. Trivedi. Clebsch–Gordan
[5] T. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling.
Nets: a fully Fourier space spherical convolutional neu-
Gauge equivariant convolutional networks and the
ral network. In Proc. NeurIPS, 2018. 2
icosahedral CNN. In Proc. ICML, 2019. 2
[22] A. Krizhevsky, G. Hinton, et al. Learning multiple
[6] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling.
layers of features from tiny images. 2009. 1, 5
Spherical CNNs. In Proc. ICLR, 2018. 2
[7] M. Contributors. MMSegmentation: Openmm- [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
lab semantic segmentation toolbox and bench- ageNet classification with deep convolutional neural
mark. https://github.com/open-mmlab/ networks. In Proc. NeurIPS, 2012. 2
mmsegmentation, 2020. 20 [24] Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Ma-
[8] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Ran- lik, and C. Feichtenhofer. MViTv2: Improved multi-
daugment: Practical automated data augmentation with scale vision transformers for classification and detec-
a reduced search space. In Proc. CVPR workshop, tion. In Proc. CVPR, 2022. 1, 5
2020. 19 [25] M. Lin, Q. Chen, and S. Yan. Network in network.
[9] P. de Haan, M. Weiler, T. Cohen, and M. Welling. arXiv preprint arXiv:1312.4400, 2013. 2
Gauge equivariant mesh CNNs: Anisotropic convo- [26] I.-J. Liu, R. A. Yeh, and A. G. Schwing. PIC: permuta-
lutions on geometric graphs. In Proc. ICLR, 2021. 2 tion invariant critic for multi-agent deep reinforcement
[10] M. Defferrard, X. Bresson, and P. Vandergheynst. Con- learning. In Proc. CORL, 2020. 2
volutional neural networks on graphs with fast local- [27] I.-J. Liu, Z. Ren, R. A. Yeh, and A. G. Schwing. Se-
ized spectral filtering. In Proc. NeurIPS, 2016. 2 mantic tracklets: An object-centric representation for
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and visual multi-agent reinforcement learning. In Proc.
L. Fei-Fei. ImageNet: A large-scale hierarchical image IROS, 2021. 2
database. In Proc. CVPR, 2009. 1, 5 [28] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,
[12] P. Ding, D. Soselia, T. Armstrong, J. Su, and F. Huang. and B. Guo. Swin transformer: Hierarchical vision
Reviving shift equivariance in vision transformers. transformer using shifted windows. In Proc. ICCV,
arXiv preprint arXiv:2306.07470, 2023. 2 2021. 1, 3, 5, 7, 18
[13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- [29] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning,
senborn, X. Zhai, T. Unterthiner, M. Dehghani, Y. Cao, Z. Zhang, L. Dong, et al. Swin transformer
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and v2: Scaling up capacity and resolution. In Proc. CVPR,
N. Houlsby. An image is worth 16x16 words: Trans- 2022. 1, 3, 5, 18
formers for image recognition at scale. In Proc. ICLR, [30] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman.
2021. 1 Invariant and equivariant graph networks. In Proc.
[14] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, ICLR, 2019. 2
and C. Feichtenhofer. Multiscale vision transformers. [31] H. Maron, O. Litany, G. Chechik, and E. Fetaya. On
In Proc. CVPR, 2021. 1, 18 learning sets of symmetric elements. In Proc. ICML,
[15] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, 2020. 2
Y. Tang, A. Xiao, C. Xu, Y. Xu, et al. A survey on [32] H. Michaeli, T. Michaeli, and D. Soudry. Alias-free
vision transformer. IEEE TPAMI, 2022. 2 convnets: Fractional shift invariance via polynomial
[16] J. Hartford, D. Graham, K. Leyton-Brown, and S. Ra- activations. In Proc. CVPR, 2023. 2
vanbakhsh. Deep models of interactions across sets. In [33] C. Morris, G. Rattan, S. Kiefer, and S. Ravanbakhsh.
Proc. ICML, 2018. 2 SpeqNets: Sparsity-aware permutation-equivariant

9
graph networks. In K. Chaudhuri, S. Jegelka, L. Song, [50] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified
C. Szepesvari, G. Niu, and S. Sabato, editors, Proc. perceptual parsing for scene understanding. In Proc.
ICML, 2022. 2 ECCV, 2018. 5, 7
[34] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: [51] R. A. Yeh, Y.-T. Hu, and A. Schwing. Chirality nets
Deep learning on point sets for 3D classification and for human pose regression. In Proc. NeurIPS, 2019. 2
segmentation. In Proc. CVPR, 2017. 2 [52] R. A. Yeh, A. G. Schwing, J. Huang, and K. Murphy.
[35] S. Ravanbakhsh, J. Schneider, and B. Póczos. Equivari- Diverse generation for multi-agent sports games. In
ance through parameter-sharing. In Proc. ICML, 2017. Proc. CVPR, 2019. 2
2 [53] R. A. Yeh, Y.-T. Hu, M. Hasegawa-Johnson, and
[36] S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep A. Schwing. Equivariance discovery by learned
learning with sets and point clouds. In Proc. ICLR parameter-sharing. In Proc. AISTATS, 2022. 2
workshop, 2017. 2 [54] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R.
[37] R. A. Rojas-Gomez, T. Y. Lim, A. G. Schwing, M. N. Salakhutdinov, and A. J. Smola. Deep sets. In Proc.
Do, and R. A. Yeh. Learnable polyphase sampling for NeurIPS, 2017. 2
shift invariant and equivariant convolutional networks. [55] R. Zhang. Making convolutional networks shift-
In Proc. NeurIPS, 2022. 2, 5, 19 invariant again. In Proc. ICML, 2019. 2, 6
[38] D. Romero, E. Bekkers, J. Tomczak, and M. Hoogen- [56] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and
doorn. Attentive group equivariant convolutional net- A. Torralba. Scene parsing through ADE20K dataset.
works. In Proc. ICML, 2020. 2 In Proc. CVPR, 2017. 1, 7
[39] D. W. Romero and S. Lohit. Learning partial equivari- [57] X. Zou, F. Xiao, Z. Yu, and Y. J. Lee. Delving deeper
ances from data. In Proc. NeurIPS, 2022. 2 into anti-aliasing in convnets. In Proc. BMVC, 2020. 2
[40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-
C. Chen. MobileNetV2: Inverted residuals and linear
bottlenecks. In Proc. CVPR, 2018. 2
[41] M. Shakerinava and S. Ravanbakhsh. Equivariant net-
works for pixelized spheres. In Proc. ICML, 2021.
2
[42] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega,
and P. Vandergheynst. The emerging field of signal
processing on graphs: Extending high-dimensional data
analysis to networks and other irregular domains. IEEE
SPM, 2013. 2
[43] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. In
Proc. ICLR, 2015. 2
[44] T. van der Ouderaa, D. W. Romero, and M. van der
Wilk. Relaxing equivariance constraints with non-
stationary continuous filters. In Proc. NeurIPS, 2022.
2
[45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. In Proc. NeurIPS, 2017. 1
[46] S. R. Venkataraman, S. Balasubramanian, and R. R.
Sarma. Building deep equivariant capsule networks. In
Proc. ICLR, 2020. 2
[47] M. Vetterli, J. Kovačević, and V. K. Goyal. Founda-
tions of signal processing. Cambridge University Press,
2014. 2
[48] M. Weiler and G. Cesa. General E(2)-equivariant steer-
able CNNs. In Proc. NeurIPS, 2019. 2
[49] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan,
and L. Zhang. CvT: Introducing convolutions to vision
transformers. In Proc. CVPR, 2021. 1, 5

10
Appendix
The appendix is organized as follows:
• Sec. A1 provides the complete proofs for all the claims stated in the main paper.
• Sec. A2 includes additional implementation, runtime details, and memory consumption.
• Sec. A3 fully describes the augmentation settings used for image classification and semantic segmentation.
• Sec. A4 provides additional semantic segmentation qualitative results.
• Sec. A5 reports additional image classification results covering:
a. The robustness of our proposed models to out-of-distribution images (Sec. A5.1).
b. The sensitivity of our ViT models to input shifts of different magnitudes (Sec. A5.2).
c. The use of our proposed adaptive modules on pre-trained ViTs (Sec. A5.3).

A1. Complete proofs


A1.1. Proof of Lemma 1

Lemma 1. L-periodic shift-equivariance of tokenization.


Let input x ∈ RN have a token representation X (m) E ∈ R⌊N/L⌋×D . If x̂ = SN x (a shifted input), then its token
representation X̂ (m) E corresponds to:
⌊(m+1)/L⌋
X̂ (m) E = S⌊N/L⌋ X ((m+1) mod L) E (15)

This implies that x and x̂ are characterized by the same L token representations, up to a circular shift along the token index
(row index of X ((m+1) mod L) E).

Proof. By definition,
m+1
X̂ (m) = reshape(SN
m
x̂) = reshape(SN x) (A27)
h i
(m)
Let the input patches be expressed as X (m) = r0(m) ...
(m)
rL−1 ∈ R⌊N/L⌋×L , where rk ∈ R⌊N/L⌋ is comprised by the
k th element of every input patch.
(m) m
rk [n] = (SN x)[Ln + k] = x[(Ln + k + m) mod N ]. (A28)
(m)
More precisely, given a circularly shifted input with offset m, rk [n] represents the k th element of the nth patch. Based on
this, X̂ (m) ∈ R⌊N/L⌋×L can be expressed as:
h i
X̂ (m) = reshape(SN m+1
x) = r0(m+1) . . . rL−1(m+1)
, (A29)
(m+1)
with rk [n] = x[(Ln + k + m + 1) mod N ]. (A30)

Expressed in terms of its quotient and remainder for divisor L, m + 1 = ⌊ m+1


L ⌋L + (m + 1) mod L. Then, r̂ corresponds to:
    
(m) m+1
r̂k [n] = x L n + ⌊ ⌋ + (m + 1) mod L + k mod N (A31)
L

(m) ⌊ m+1 ⌋ ((m+1) mod L)


It follows that r̂k = S⌊ N L⌋ rk , which implies
L

⌊ m+1 ⌋
X̂ (m) E = S⌊ N L⌋ X ((m+1) mod L) E. (A32)
L

11
A1.2. Proof of Claim 1
Claim 1. Shift-equivariance of adaptive tokenization.
If F in Eq. (14) is shift-invariant, then A-token is shift-equivariant, i.e., ∃ mq ∈ {0, . . . , L − 1} s.t.
 mq
A-token SN x = S⌊N/L⌋ A-token(x). (16)

Proof. Let x̂ = SN x ∈ RN be a circularly shifted version of input x ∈ RN . From Lemma 1, their token representations
satisfy:
⌊ m+1 ⌋
X̂ (m) E = S⌊ N L⌋ X ((m+1) mod L) E. (A33)
L

Based on the selection criterion of A-token and Eq. (A33):


 m+1 
⌊ ⌋
max F (X̂ (m) E) = F S⌊ N L⌋ X ((m+1) mod L) E
max (A34)
m∈{0,...,L−1} m∈{0,...,L−1} L
 
((m+1) mod L)
= max F X E , (A35)
m∈{0,...,L−1}

where the right-hand side in Eq. (A35) derives from the shift-invariance property of F . Note that for any integer m in the range
{0, . . . , L − 1}, the circular shift (m + 1) mod L also lies within the same range. It follows that:
 
F X ((m+1) mod L) E = F X (m) E .

max max (A36)
m∈{0,...,L−1} m∈{0,...,L−1}

Next, let X̂ (m̂) E = A-token(x̂). Then, from Lemma 1:

max F X̂ (m) E) = F X̂ (m̂) E) = F X ((m̂+1) mod L) E) (A37)


m∈{0,...,L−1}

Let X (m ) E = A-token(x). Then, From Eq. (A36) and Eq. (A37):
max F X (m) E) = F X ((m̂+1) mod L) E), (A38)
m∈{0,...,L−1}

which implies that m⋆ = (m̂ + 1) mod L. Finally, from Lemma 1 and Eq. (A38):
(m̂+1) (m̂+1)
⌊ ⌋ ⌊ ⌋
A-token(SN x) = S⌊ N ⌋L X ((m̂+1) mod L) E = S⌊ N ⌋L A-token(x). (A39)
L L

A1.3. Proof of Claim 2


Claim 2. If G in Eq. (19) is shift invariant, then A-WSA is shift-equivariant.

Proof. Let T ∈ RM and T̂ = SM T ∈ RM denote two token representations related by a circular shift. From the definition
(m)
of vW in Eq. (17), let v ∈ RM denote the average ℓp -norm (energy) of each group of W neighboring tokens in T . More
precisely, the k-th component of the energy vector v, denoted as v[k], is the energy of the window comprised by W neighboring
tokens starting at index k.
W −1
1 X
v[k] = ∥T(k+l) mod M ∥p (A40)
W
l=0

(m) M (m) M
Similarly, following Eq. (18), let vW ∈ R⌊ W ⌋ and v̂W ∈ R⌊ W ⌋ denote the energy vectors of non-overlapping windows
obtained after shifting T and T̂ by m indices, respectively. Then, maximizers m⋆ and m̂ satisfy
(m)  (m) 
m⋆ = arg max G vW , m̂ = arg max G v̂W (A41)
m∈{0,...,W −1} m∈{0,...,W −1}

12
In what follows, we prove that their adaptive window-based self-attention outputs are related by a circular shift that is a
multiple of the window size W , i.e., there exists m0 ∈ Z such that
m0 W
A-WSA(SM T ) = SM A-WSA(T ). (A42)

Given the shifted input token representation T̂ = SM T , the energy of each group of W neighboring tokens corresponds to
W −1 W −1
1 X 1 X
v̂[k] = ∥(SM T )(k+l) mod M ∥p = ∥T(k+l+1) mod M ∥p (A43)
W W
l=0 l=0
= v[k + 1] (A44)
(m)
which implies v̂ = SM v. Then, v̂W can be expressed as
(m)
v̂W [k] = v̂[W k + m] = v[W k + m + 1]. (A45)
Expressing m + 1 in terms of its quotient and remainder for divisor W
 
(m) m+1 
v̂M [k] = v W k + ⌊ ⌋ + (m + 1) mod W (A46)
W
⌊ m+1 ⌋ ((m+1) mod W )
= S⌊ MW⌋ vW . (A47)
W

Based on the shift-invariant property of G and the fact that (m + 1) mod W ∈ {0, . . . , W − 1}, A-WSA selection criterion
corresponds to
(m)  ⌊ m+1 ⌋ ((m+1) mod W ) 
max G v̂W = max G S⌊ MW⌋ vW (A48)
m∈{0,...,W −1} m∈{0,...,W −1} W

((m+1) mod W ) 
= max G vW (A49)
m∈{0,...,W −1}
(m) 
= max G vW . (A50)
m∈{0,...,W −1}

Then, from m̂ in Eq. (A41)


 
(m)  (m̂)  ((m̂+1) mod W )
max G v̂W = G v̂W = G vW (A51)
m∈{0,...,W −1}

which implies m⋆ = (m̂ + 1) mod W . It follows that the adaptive self-attention of T̂ can be expressed as:
m̂ m̂+1
A-WSA(T̂ ) = WSA(SM T̂ ) = WSA(SM T) (A52)

 
⌊(m̂+1)/W ⌋W +m
= WSA SM T (A53)
 
⌊(m̂+1)/W ⌋W m⋆
= WSA SM SM T . (A54)
⌊(m̂+1)/W ⌋W
Since SM corresponds to a circular shift by a multiple of the window size W
⌊(m̂+1)/W ⌋W ⋆
m
A-WSA(T̂ ) = SM WSA(SM T) (A55)
where the right-hand side of Eq. (A55) stems from the fact that, for WSA with a window size W , circularly shifting the input
tokens by a multiple of W results in an identical circular shift of the output tokens. Finally, from the definition of adaptive
self-attention in Eq. (19)
m0 W
A-WSA(T̂ ) = SM A-WSA(T ), m0 = ⌊(m̂ + 1)/W ⌋. (A56)

Claim 2 shows that A-WSA induces an offset between token representations that is a multiple of the window size W .
As a result, windows are comprised by the same tokens, despite circular shifts. This way, A-WSA guarantees that an input
token representation and its circularly shifted version are split into the same token windows, leading to a circularly shifted
self-attention output.

13
A1.4. Proof of Claim 3

Claim 3. PMerge corresponds to a strided convolution with D̃ output channels, striding P and kernel size P .

Proof. Let the input tokens be expressed as T = t0 . . . tD−1 ∈ RM ×D , where tj ∈ RM is comprised by the j th
 
th D
 of every inputtoken. This implies that, given the l token Tl ∈ R , Tl [j] = tj [l]. Then, the patch merging output
element
Z = z0 . . . zD̃−1 can be expressed as
M
Z = PMerge(T ) = D(P ) (Y ) ∈ R P ×D̃ , (A57)
M
where D(P ) ∈ R P ×M is a downsampling operator of factor P , and Y = y0 yD̃−1 ∈ RM ×D̃ is the output of a
 
...
convolution sum with kernel size P and D̃ output channels
D−1
X
yk = tj ⊛ h(k,j) ∈ RM (A58)
j=0

h i⊤
where ⊛ denotes circular convolution and h(k,j) = h(k,j)
0 ...
(k,j)
hP −1 ∈ RP denotes a convolutional kernel. Note that
the convolution sum in Eq. (A58) involves D · D̃ kernels, given that k ∈ {0, . . . , D̃ − 1} and j ∈ {0, . . . , D − 1}. Without
loss of generality, assume the number of input tokens M is divisible by the patch length P . Then, due to the linearity of the
M
downsampling operator, zk ∈ R P corresponds to
D−1
X  
zk = D(P ) tj ⊛ h(k,j) . (A59)
j=0

Let H (k,j) ∈ RM ×M be a convolutional matrix representing kernel h(k,j)


 (k,j) (k,j)

h0 ··· hP −1
(k,j) (k,j) 
h0 ··· hP −1 

H (k,j) = 

. (A60)
..
.
 
 
(k,j) (k,j)
··· hP −1 ··· h0

Then, note that each summation term in Eq. (A59) can be expressed as a matrix-vector multiplication
D−1
X
zk = H̃ (k,j) tj (A61)
j=0

M
where H̃ (k,j) = D(P ) H (k,j) ∈ R P ×M corresponds to the convolutional matrix H (k,j) downsampled by a factor P along
the row index
 (k,j) (k,j)

h0 · · · hP −1
(k,j) (k,j)
h0 · · · hP −1
 
(k,j)
 
H̃ = 

..
. (A62)
.

 
(k,j) (k,j)
h0 · · · hP −1

Based on this, the convolution sum in Eq. (A59) can be expressed in matrix-vector form as
 
t0
zk = H̃ (k,0) · · · H̃ (k,D−1)  ...  .
 
(A63)

tD−1

14
Then, from the patch representation of T in Eq. (10), zk can be alternatively expressed as

zk = T̃ ẽk (A64)

ẽD̃−1 ∈ RP D×D̃ . Let vec(T̄P(k) ) ∈ RP D operate in a column-wise manner:


 
where ẽk is the k th column of Ẽ = ẽ0 ···
(k)  ⊤ 
vec(T̄P ) = vec TP k . . . TP (k+1)−1 (A65)
 ⊤
= TP k [0] · · · TP (k+1)−1 [0] TP k [1] · · · TP (k+1)−1 [1] · · · TP (k+1)−1 [D − 1] (A66)
 ⊤
= t0 [P k] · · · t0 [P (k + 1) − 1] t1 [P k] · · · t1 [P (k + 1) − 1] · · · tD−1 [P (k + 1) − 1] . (A67)

Finally, from Eq. (A63), ẽk is equivalent to

ẽk = h(k,0) ; . . . ; h(k,D−1) ∈ RDP


 
(A68)

Claim 3 shows that the original patch merging function PMerge is equivalent to projecting all M overlapping patches of
length P , which can be expressed as a circular convolution, followed by keeping only M P of the resulting tokens. Note that the
selection of such tokens is done via a downsampling operation of factor P , which is not the only way of selecting them. In fact,
there are P different token representations that can be selected, as explained in Sec. 4.
Based on this observation, the proposed A-PMerge introduced in Sec. 4 takes advantage of the polyphase decomposition
to select the token representation in a data-dependent fashion, leading to a shift-equivariant patch merging scheme.

A2. Additional implementation details


We have attached our code in the supplementary materials. We now provide an illustration of how to use our implementation
and verify that the proposed modules are indeed truly shift-equivariant.
A2.1. A-pmerge usage
We show a toy example of how to use the adaptive patch merging layer (A-pmerge) by building a simple image classifier.
The model is comprised by an A-pmerge layer, which includes a linear embedding, followed by a global average pooling
layer and finally a linear classification head. We empirically show that this simple model is shift-invariant.

# A-pmerge based classifier


class ApmergeClassifier(nn.Module):
def __init__(
self,stride,input_resolution,
dim,num_classes=4,conv_padding_mode='circular'
):
super().__init__()
self.stride=stride
# Pooling layer for A-pmerge
pool_layer = partial(
PolyphaseInvariantDown2D,
component_selection=max_p_norm,
antialias_layer=None,
)
# A-pmerge
# No adaptive window selection
# for illustration purposes
self.apmerge = AdaptivePatchMerging(
input_resolution=input_resolution,
dim=dim,
pool_layer=pool_layer,

15
conv_padding_mode=conv_padding_mode,
stride=stride,
window_selection=None,
window_size=None,
)
# Global pooling and head
self.avgpool=nn.AdaptiveAvgPool2d((1,1))
self.fc=nn.Linear(dim*stride,num_classes)

def forward(self,x):
# Reshape
B,C,H,W = x.shape
x = x.permute(0,2,3,1).reshape(B,H*W,C)
# Adaptive patch merge
x = self.apmerge(x)
# Reshape back
x = x.reshape(
B,H//self.stride,W//self.stride,C*self.stride,
).permute(0,3,1,2)
# Global average pooling
x = torch.flatten(self.avgpool(x),1)
# Classification head
x = self.fc(x)
return x

# Input tokens
B,C,H,W = 1,3,8,8
stride = 2
x = torch.randn(B,C,H,W).cuda().double()
# Shifted input
shift = torch.randint(-3,3,(2,))
x_shift = torch.roll(input=x,dims=(2,3),shifts=(shift[0],shift[1]))
# A-pmerge classifier
model = ApmergeClassifier(
stride=stride,
input_resolution=(H,W),
dim=C).cuda().double().eval()
# Predict
y = model(x)
y_shift = model(x_shift)
err = torch.norm(y-y_shift)
assert(torch.allclose(y,y_shift))
# Check circularly shift invariance
print('y: {}'.format(y))
print('y_shift: {}'.format(y_shift))
print("error: {}".format(err))

Out:
y : tensor ([[ −0.1878 , 0.2348 , 0.0982 , −0.2191]] ,
d e v i c e = ' cuda : 0 ' , dtype = t o r c h . f l o a t 6 4 ,
g r a d _ f n =<AddmmBackward0 > )

16
y _ s h i f t : tensor ([[ −0.1878 , 0.2348 , 0.0982 , −0.2191]] ,
d e v i c e = ' cuda : 0 ' , dtype = t o r c h . f l o a t 6 4 ,
g r a d _ f n =<AddmmBackward0 > )
error : 0.0

A2.2. Simple encoder-decoder usage


To illustrate the use of our adaptive layers for semantic segmentation purposes, we demonstrate the use of the A-PMerge’s
optimal indices and the A-WSA’s optimal offset for unpooling purposes. We build a simple encoder-decoder based on an
A-PMerge layer equipped with A-WSA (implemented as poly_win in the example code) to encode input tokens, followed
by an unpooling module to place features back into their original positions at high resolution.
This example depicts the strategy used by our proposed semantic segmentation models, where backbone indices are passed
to the segmentation head to upscale feature maps in a consistent fashion, leading to a shift-equivariant model.

# Simple encoder-decoder
class EncDec(nn.Module):
def __init__(self,dims,stride,win_sel,win_sz,pad):
super().__init__()
self.pad = pad
self.stride = stride
self.win_sz = win_sz

# A-pmerge equipped with A-WSA


pool_layer = get_pool_method(name='max_2_norm',antialias_mode='skip')
self.apmerge = AdaptivePatchMerging(
dim=dims,
pool_layer=pool_layer,
conv_padding_mode=pad,
stride=stride,
window_size=win_sz,
window_selection=win_sel,
)
# Unpool
unpool_layer = get_unpool_method(unpool=True,pool_method='max_2_norm',
antialias_mode='skip')
self.unpool = set_unpool(unpool_layer=unpool_layer,stride=stride,
p_ch=dims)

def forward(self,x,hw_shape):
B,C,H,W = x.shape
# Reshape
x = x.permute(0,2,3,1).reshape(B,H*W,C)
# A-pmerge + A-WSA
x_w,hw_shape,idx = self.apmerge(x=x,hw_shape=hw_shape,ret_indices=True)
# Keep merging and windowing indices
idx = idx[:3]
# Reshape back
x_w = x_w.reshape(
B,H//self.stride,W//self.stride,C*self.stride).permute(0,3,1,2)
# Unroll and Unpool
y = unroll_unpool_multistage(
x=x_w,
scale_factor=[self.stride],

17
unpool_layer=self.unpool,
idx=[idx],
unpool_winsel_roll=self.pad,
)
return y

# Input
B,C,H,W = 1,3,28,28
shift_max = 6
x = torch.randn(B,C,H,W).cuda().double()
# Offsets
s01 = torch.randint(low=-shift_max,high=shift_max,size=(1,2)).tolist()[0]
s02 = torch.randint(low=-shift_max,high=shift_max,size=(1,2)).tolist()[0]
s03 = [s02[0]-s01[0],s02[1]-s01[1]]
# Shifted inputs
x01 = torch.roll(x,shifts=s01,dims=(-1,-2))
x02 = torch.roll(x,shifts=s02,dims=(-1,-2))
# Build encoder-decoder
# Use A-WSA (poly_win)
model = EncDec(dims=3,stride=2,pad='circular',win_sz=7,win_sel=poly_win,
).cuda().double().eval()
# Predictions
y01 = model(x01,hw_shape=(H,W))
y02 = model(x02,hw_shape=(H,W))
# Shift to compare
z = torch.roll(y01,shifts=s03,dims=(-1,-2))
err = torch.norm(z-y02)
assert torch.allclose(z,y02)
print("torch.norm(z-y02): {}".format(err))

Out:

t o r c h . norm ( z −y02 ) : 0 . 0

A2.3. Memory Consumption


We compare the computational requirements of our adaptive Vision Transformer models and their default versions in terms of
training and inference GPU memory. We report memory consumption on a single NVIDIA Quadro RTX 5000 GPU, where
both training and inference are performed using batch size 64 and the default image size per model.
Following previous work [14, 28, 29], we compare the allocated GPU memory required by each model via NVIDIA’s
torch.cuda.max_memory_allocated() command. This measures the maximum GPU memory occupied by tensors
since the beginning of the executed program. Additionally, we report the maximum reserved GPU memory, as assigned by the
CUDA memory allocator, using the torch.cuda.max_memory_reserved() command. This measures the maximum
memory occupied by tensors plus the extra memory reserved to speed up allocations. The required memory is reported in
Megabytes (1 MiB = 220 bytes). We also report the relative change, which corresponds to the memory increase with respect to
the default models.
Tab. A1 shows the memory required for training and inference for each of our adaptive models and their default versions.
In terms of training, our adaptive framework marginally increases the allocated memory requirements up to 16% for MViTv2,
while the rest of the models increase the memory requirements by less than 8%. This trend is also seen in the reserved memory,
where it increases up to 15%. Additionally, in terms of inference, only our adaptive MViTv2 model marginally increases the
allocated memory by 11%, while the rest of our adaptive models do not increase the required allocated memory. In terms of
reserved memory, our adaptive models increase the requirement at most by 2%, while in some cases the reserved memory

18
Training Inference
Model
Max. Allocated Memory (MiB) Max. Reserved Memory (MiB) Max. Allocated Memory (MiB) Max. Reserved Memory (MiB)
Swin-T 4, 935 5, 380 1, 012 1, 352
A-Swin-T (Ours) 5, 178 (+4.92%) 5, 678 (+5.54%) 1, 012 (0%) 1, 382 (+2.22%)
SwinV2-T 7, 712 8, 016 1, 275 1, 544
A-SwinV2-T (Ours) 8, 113 (+5.2%) 8, 710 (+8.66%) 1, 275 (0%) 1, 534 (−0.65%)
CvT-13 5, 794 6, 044 1, 587 1, 850
A-CvT-13 (Ours) 6, 257 (+7.99%) 6, 528 (+8.01%) 1, 587 (0%) 1, 780 (−3.78%)
MViTv2-T 6, 335 7, 352 1, 944 3, 356
A-MViTv2-T (Ours) 7, 364 (+16.24%) 8, 456 (+15.02%) 2, 164 (+11.32%) 3, 378 (+0.67%)

Table A1. Memory Consumption: Maximum allocated and reserved memory required by our adaptive ViTs and their default versions.
Training and inference memory consumption is calculated on a single NVIDIA Quadro RTX 5000 GPU (batch size 64, default image size
per model), and reported in Megabytes (MiB). We also report the relative change with respect to the default models (%), which is shown in
parentheses.

Train Set Test Set


(i) Resize (Swin, MViTv2, CvT: 256 × 256. SwinV2: 292 × 292)
Size Preprocessing
(ii) Center Crop (Swin, MViTv2, CvT: 224 × 224, SwinV2: 256 × 256)
(i) Random Horizontal Flipping
Augmentation Normalization
(ii) Normalization
Table A2. Image Classification preprocessing (circular shifts). Data preprocessing and augmentation used to train and test all default and
adaptive ViTs under circular shifts.

decreases by 3%.
Overall, the training memory consumption marginally increases with respect to the default models, while the inference
memory consumption remains almost unaffected.

A3. Additional experimental details


A3.1. Image classification
Data pre-processing (ImageNet circular shifts, CIFAR10/100). Similarly to previous work on shift-equivariant CNNs
[3, 37], to highlight the fact that perfect shift equivariance is imposed by design and not induced during training, no shift or
resizing augmentation is applied. Tab. A2 includes the data pre-processing used in our ImageNet experiments under circular
shifts, and in all our CIFAR10/100 experiments. As explained in Sec. 5.2, input images are transformed to match the default
image size used by each model. This process is denoted as Size pre-processing in the table.
Data pre-processing (ImageNet standard shifts). In contrast to the circular shift scenario, each model uses its default data
augmentation configuration for the ImageNet standard shift scenario. Tables A3, A4 and A5 include the augmentation details
of each model, A-Swin, A-SwinV2, A-MViTv2, and A-CVT respectively. Similarly to the circular shift case, all models use a
data pre-processing step to change the input image size.
Note that all four models use RandAugment [8] as data augmentation pipeline, which includes: AutoContrast,
Equalize, Invert, Rotate, Posterize, Solarize, ColorIncreasing, ContrastIncreasing,
BrightnessIncreasing, SharpnessIncreasing, Shear-x, Shear-y, Translate-x, and Translate-y.
Computational settings. CIFAR10/100 classification experiments using all four models were trained on two NVIDIA Quadro
RTX 5000 GPUs with a batch size of 48 images for 100 epochs. On the other hand, ImageNet classification experiments were
trained on eight NVIDIA A100 GPUs, where all models used their default effective batch size and numerical precision for 300
epochs.

A3.2. Semantic Segmentation


Data pre-processing (Circular shifts). For Swin+UperNet, both the baseline and adaptive (A-Swin) models were trained
by resizing input images to size 1, 792 × 448 followed by a random cropping of size 448 × 448. Next, the default data
augmentation used in the Swin + UperNet model is used. During testing, input images are resized to 1792 × 448. Then,
each dimension size is rounded up to the next multiple of 224. Tab. A6 describes the pre-processing and data augmentation
pipelines.

19
Train Set Test Set
(i) Resize (Swin: 256 × 256, SwinV2: 292 × 292)
Size Preprocessing
(ii) Center Crop (Swin: 224 × 224, SwinV2: 256 × 256)
(i) Random Horizontal Flipping
(ii) RandAugment
(magnitude: 9
increase severity: True,
Augmentation Normalization
augmentations per image: 2,
standard deviation: 0.5)
(iii) Normalization
(iv) RandomErasing

Table A3. Swin / SwinV2 Image Classification preprocessing (standard shifts). Data preprocessing and augmentation used to train and
test the default (Swin, SwinV2) and adaptive (A-Swin, A-SwinV2) models.

Train Set Test Set


(i) Resize (256 × 256)
Size Preprocessing
(ii) Center Crop (224 × 224)
(i) Random Horizontal Flipping
(ii) RandAugment
(magnitude: 10
increase severity: True,
Augmentation Normalization
augmentations per image: 6,
standard deviation: 0.5)
(iii) Normalization
(iv) RandomErasing

Table A4. MViTv2 Image Classification preprocessing (standard shifts). Data preprocessing and augmentation used to train and test the
default (MViTv2) and adaptive (A-MViTv2) models.

Train Set Test Set


(i) Resize (256 × 256)
Size Preprocessing
(ii) Center Crop (224 × 224)
(i) Random Horizontal Flipping
(ii) RandAugment
(magnitude: 9
increase severity: True,
Augmentation Normalization
augmentations per image: 2,
standard deviation: 0.5)
(iii) Normalization
(iv) RandomErasing

Table A5. CvT Image Classification preprocessing (standard shifts). Data preprocessing and augmentation used to train and test the
default (CvT) and adaptive (A-CvT) models.

For Swin-V2+UperNet, both the baseline and the adaptive (A-SwinV2) models were trained by resizing input images to
size 2, 048 × 512 followed by a random cropping of size 512 × 512. Following this, the default data augmentation used in the
Swin + UperNet model is adopted. During testing, input images are resized to 2, 048 × 512. Then, each dimension size is
rounded up to the next multiple of 256. Tab. A7 describes the pre-processing and data augmentation pipelines.
Data pre-processing (Standard Shifts). For the Swin+UperNet baseline and adaptive models, the default pre-processing
pipeline from the official MMseg implementation [7] was used. The same pipeline was used to evaluate the SwinV2+UperNet
baseline and adaptive models. The pre-processing pipeline is detailed in Tab. A8.
Computational settings. ADE20K semantic segmentation experiments using Swin and SwinV2 models, both adaptive and
baseline architectures, were trained on four NVIDIA A100 GPUs with an effective batch size of 16 images for 160, 000

20
Train Set Test Set
(i) Resize: 1, 792 × 448 (i) Resize: 1, 792 × 448
Size Preprocessing
(ii) Random Crop: 448 × 448 (ii) Resize to next multiple of 224
(i) Random Horizontal Flipping
Augmentation (ii) Photometric Distortion Normalization
(iii) Normalization

Table A6. Swin + UperNet Semantic Segmentation preprocessing (circular shifts). Data preprocessing and augmentation used to train
and test the default (Swin+UperNet) and adaptive (A-Swin+UperNet) models.

Train Set Test Set


(i) Resize: 2, 048 × 512 (i) Resize: 2, 042 × 512
Size Preprocessing
(ii) Random Crop: 512 × 512 (ii) Resize to next multiple of 256
(i) Random Horizontal Flipping
Augmentation (ii) Photometric Distortion Normalization
(iii) Normalization

Table A7. SwinV2 + UperNet Semantic Segmentation preprocessing (circular shifts). Data preprocessing and augmentation used to
train and test the default (SwinV2+UperNet) and adaptive (A-SwinV2+UperNet) models.

Train Set Test Set


(i) Resize: 2, 048 × 512
Size Preprocessing Resize: 2, 048 × 512
(ii) Random Crop: 512 × 512
(i) Random Horizontal Flipping
Augmentation (ii) Photometric Distortion Normalization
(iii) Normalization

Table A8. Semantic Segmentation preprocessing (standard shifts). Data preprocessing and augmentation used to train and test the default
and adaptive versions of both SwinV2+UperNet and A-SwinV2+UperNet models.

iterations. These settings are consistent with the official MMSeg configuration for the Swin+UperNet model.

A4. Additional semantic segmentation results


Fig. A1 and A2 show examples of semantic segmentation predicted masks for our A-Swin+UperNet and A-SwinV2+Upernet
models, respectively. Illustrations include masks obtained with their corresponding baselines, showing the improved robustness
of our models against input shifts.

A5. Additional image classification experiments


A5.1. Robustness to out of distribution images
We evaluate the image classification performance of our adaptive models on out-of-distribution inputs. Precisely, we measure
the robustness of our proposed A-Swin-T model to images with randomly erased patches and vertically flipped images.
Experiment Setup. Following the evaluation protocol of Chaman and Dokmanic [3], we compare the performance of three
versions of Swin-T: (i) its default model, (ii) our proposed adaptive model (A-Swin-T), and (iii) the default Swin-T architecture
trained on circularly shifted images, denoted as Swin-T DA (data augmented).
All three models are trained on CIFAR-10. Swin-T and A-Swin-T use the default data pre-processing used for CIFAR-10
(refer to Sec. A3), while Swin-T DA uses the default pre-processing plus circular shifts with offsets uniformly selected between
−224 and 224 pixels. Note that CIFAR-10 samples are resized from 32 × 32 to 224 × 224 pixels during pre-processing. So,
the circular offsets are effectively selected between −32 and 32 pixels.
During testing, the patch size is randomly selected between 0 and max, where max ∈ {28, 42, 56, 70} pixels. Since
CIFAR-10 samples are rescaled to 224 x 224 pixels, the maximum patch sizes are effectively {4, 6, 8, 10} pixels.
Results. Tab. A9 shows the top-1 classification accuracy and circular shift consistency of the three Swin-T models of interest
on images with randomly erased patches. Across patch sizes, while the default model and Swin-T DA decrease their shift

21
Model max = 0 max = 28 max = 42 max = 56 max = 70
Acc. C-Cons. Top-1 Acc. C-Cons. Top-1 Acc. C-Cons. Top-1 Acc. C-Cons. Top-1 Acc. C-Cons.
Swin-T (Default) 90.21 82.69 90.16 81.75 89.71 81.42 89.02 80.4 87.83 79.58
Swin-T DA 92.32 94.3 91.81 94.19 91.6 94.16 90.42 93.01 89.23 92.73
A-Swin-T (Ours) 93.53 100 93.27 100 92.71 100 91.57 100 90.3 100
Table A9. Performance on images with randomly erased patches. Top-1 classification accuracy (%) and shift consistency (%) on
CIFAR-10 test images with randomly erased square patches. Max corresponds to the largest possible patch size, as sampled from a uniform
distribution U {0, max}.

Model Unflipped Flipped


Top-1 Acc.(%) C-Cons.(%) Top-1 Acc.(%) C-Cons.(%)
Swin-T (Default) 90.21 82.69 50.41 82.24
Swin-T DA 92.32 94.3 51.74 94.67
A-Swin-T (Ours) 93.53 100 52.07 100
Table A10. Performance on flipped images. Top-1 classification accuracy and shift consistency on vertically flipped CIFAR-10 test images.

Model Offset ∈ {0, . . . , 8} Offset ∈ {0, . . . , 16} Offset ∈ {0, . . . , 24} Offset ∈ {0, . . . , 32}
Swin-T 92.00 ± .23 89.65 ± .9 88.93 ± .09 88.13 ± .14
A-Swin-T (Ours) 100 100 100 100
SwinV2-T 92.13 ± .04 90.43 ± .17 89.67 ± .08 88.75 ± .15
A-SwinV2-T (Ours) 100 100 100 100
CvT-13 88.99 ± .1 87.42 ± 0.16 86.96 ± .1 86.84 ± .06
A-CvT-13 (Ours) 100 100 100 100
MViTv2-T 91.49 ± .04 90.57. ± .11 90.46 ± .08 90.22 ± .1
A-MViTv2-T (Ours) 100 100 100 100
Table A11. Consistency under different shift magnitudes. Shift consistency (%) of our adaptive ViT models under small, medium, large,
and very large shifts. Models trained and evaluated on CIFAR-10 under a circular shift assumption.

consistency by at least 1.5%, our A-Swin-T preserves its perfect shift consistency. Our adaptive model also gets the best
classification accuracy in all cases, improving by more than 1%. This suggests that, despite not being explicitly trained on this
transformation, our A-Swin model is more robust than the default and DA models, obtaining better accuracy and consistency
across scenarios.
On the other hand, Tab. A10 shows the shift consistency and classification accuracy of the three Swin models evaluated
under vertically flipped images. Even in such a challenging case and without any fine-tuning, our A-Swin-T model retains its
perfect shift consistency, improving over the default and data-augmented models by at least 5%. Our adaptive model also
outperforms the default and data-augmented models in terms of classification accuracy, both on flipped and unflipped images,
by a significant margin.

A5.2. Sensitivity to input shifts


To measure the shift consistency improvement of our proposed adaptive models over the default ones, we conduct a fine-grained
evaluation of shift consistency by analyzing different offset magnitudes.
Experiment setup. We test the shift sensitivity of our four proposed adaptive ViTs on CIFAR-10. Circular shift consistency is
evaluated on images shifted by an offset randomly selected from four different ranges: [0, 56], [0, 112], [0, 168], and [0, 224]
pixels. Note that CIFAR-10 images are resized from 32 × 32 to 224 × 224 pixels. So, the effective intervals correspond to
[0, 8], [0, 16], [0, 24], and [0, 32] pixels, respectively. By gradually increasing the offset magnitude, we can understand the
advantages of our adaptive models over their default versions in a more general manner.
Results. Tab. A11 shows the circular shift consistency of all four models under different shift magnitudes. While the default
versions monotonically decrease their shift consistency with respect to the offset magnitude, our method shows a perfect shift
consistency across scenarios.
Despite the strong consistency obtained by default ViTs via data augmentation, our adaptive models outperform them by
more than 9% without any fine-tuning. This shows the benefits of our adaptive models, regardless of the shift magnitude.

22
Model Top-1 Acc.(%) C-Cons.(%)
CvT-13 (Default) 90.12 76.54
CvT-13 + Adapt (No Fine-tuning) 57.4 100
CvT-13 + Adapt (10 epoch Fine-tuning) 91.72 100
CvT-13 + Adapt (20 epoch Fine-tuning) 92.35 100
A-CvT-13 (Ours) 93.87 100

Table A12. Incorporating shift-equivariant modules on pre-trained ViTs. Our shift-equivariant ViT framework allows plugging-in shift
equivariant modules on pre-trained models (e.g. CvT-13), improving on classification accuracy after a few fine-tuning iterations while
preserving its perfect shift consistency. Results shown on CvT-13 trained on CIFAR-10 under a circular shift assumption.

A5.3. Replacing adaptive modules in pre-trained ViTs


We ran additional experiments plugging in our proposed adaptive modules on pre-trained default ViTs. Note that, while
replacing default ViT modules with adaptive ones guarantees a truly shift-equivariant model (100% shift consistency), it is
expected for the classification accuracy to decrease since the pre-trained weights have not been trained to adaptively select
different token or window representations. Nevertheless, we are interested in exploring this scenario as a refined initialization
strategy, in order to evaluate its effect in terms of classification accuracy and shift consistency.
Experiment setup. Given a default CvT-13 model trained on CIFAR-10, we replace its components with our proposed
framework. We denote this model as CvT-13 + Adapt. Next, we fine-tune the model for 10 and 20 epochs and evaluate its
top-1 classification accuracy as well as shift consistency.
Results. Tab. A12 shows the top-1 classification accuracy and shift consistency of CvT-13 + Adapt, as well as the default
CvT-13 and our proposed A-CvT-13 model. Without any fine-tuning, CvT-13 + Adapt attains perfect shift consistency just
by plugging in our proposed adaptive modules. However, its top-1 classification accuracy decreases by more than 30% with
respect to the default model. By fine-tuning the model for 10 and 20 epochs, its classification accuracy boosts up to 91.72%
and 92.35%, respectively. This implies that CvT-13 + Adapt is capable of outperforming the default model both in consistency
and accuracy by plugging in our adaptive components plus a few fine-tuning iterations.
On the other hand, while promising results are obtained via fine-tuning, our adaptive model A-CvT-13, fully trained for
90 epochs using random initialization and the default training settings, still attains the best classification accuracy and shift
consistency results.

23
Image Shifted Image Swin + UperNet Predictions A-Swin + UperNet Predictions (Ours)

Figure A1. Swin + UperNet Semantic Segmentation under standard shifts: Semantic segmentation results on the ADE20K dataset
(standard shifts) via Swin backbones: Our A-Swin + UperNet model is more robust to input shifts than the original Swin + UperNet model,
generating consistent predictions while improving accuracy. Examples of prediction changes due to input shifts are boxed in yellow.

24
Image Shifted Image SwinV2 + UperNet Predictions A-SwinV2 + UperNet Predictions (Ours)

Figure A2. SwinV2 + UperNet Semantic Segmentation under standard shifts: Additional semantic segmentation results on the ADE20K
dataset (standard shifts) via SwinV2 backbones: Our adaptive model improves both segmentation accuracy and shift consistency with respect
to the original SwinV2 + UperNet model. Examples of prediction changes due to input shifts are boxed in yellow.

25

You might also like