Shift-Equivariant Vision Transformers
Shift-Equivariant Vision Transformers
1
(a) Original tokenization (token). (b) Proposed adaptive tokenization (A-token).
Figure 1. Re-designing ViT’s tokenization towards shift-equivariance: (a) The original patch embedding is sensitive to small input shifts
due to the fixed grid used to split an image into patches. (b) Our adaptive tokenization A-token is a generalization that consistently selects
the group of patches with the highest energy, despite input shifts.
structure. A more in-depth discussion on ViTs can be found equivariance. For readability, the concepts are described in
in recent surveys [15, 18]. In this work, we re-examine ViTs’ 1D. In practice, these are extended to multi-channel images.
modules and present a novel adaptive design that enables Equivariance. Conceptually, equivariance describes how a
truly shift-equivariant ViTs. function’s input and output are related to each other under
Concurrently on ArXiv, Ding et al. [12] propose a a predefined set of transformations. For example, in image
polyphase anchoring method to obtain circular shift invariant segmentation, shift equivariance means that shifting the input
ViTs in image classification. Differently, our method demon- image results in shifting the output mask. In our analysis, we
strates improvements in circular and linear shift consistency consider circular shifts and denote them as
for both image classification and semantic segmentation
SN x [n] = x[(n + 1) mod N ], x ∈ RN
while maintaining competitive task performance. (1)
Invariant and equivariant CNNs. Prior works [1, 55] have
shown that modern CNNs [17, 23, 40, 43] are not shift- to ensure that the shifted signal x remains within its sup-
equivariant due to the usage of pooling layers. To improve port. Following Rojas-Gomez et al. [37], we say a func-
shift-equivariance, anti-aliasing [47] is introduced before tion f : RN 7→ RM is SN , {SM , I}-equivariant, i.e., shift-
downsampling. Specifically, Zhang [55] and Zou et al. [57] equivariant, iff ∃ S ∈ {SM , I} s.t.
propose to use a low-pass filter (LPF) for anti-aliasing.
f (SN x) = Sf (x) ∀x ∈ RN (2)
While anti-aliasing improves shift-equivariance, the over-
all CNN remains not truly shift-equivariant. To address where I denotes the identity mapping. This definition care-
this, Chaman and Dokmanic [3] propose Adaptive Polyphase fully handles the case where N > M . For instance, when
Sampling (APS), which selects the downsampling indices downsampling by a factor of two, an input shift by one
based on the ℓ2 norm of the input’s polyphase compo- should ideally induce an output shift by 0.5, which is not
nents. Rojas-Gomez et al. [37] then improve APS by propos- realizable on the integer grid. This 0.5 has to be rounded up
ing a learnable downsampling module (LPS) that is truly or down, hence a shift SM or a no-shift I, respectively.
shift-equivariant. Michaeli et al. [32] propose the use of Invariance. For classification, a label remains unchanged
polynomial activations to obtain shift-invariant CNNs. In when the image is shifted, i.e., it is shift-invariant. A function
contrast, we present a family of modules that enables truly f : RN 7→ RM is SN , {I}-equivariant (invariant) iff
shift-equivariant ViTs. We emphasize that CNN methods are
not applicable to ViTs due to their distinct architectures. f (SN x) = f (x) ∀x ∈ RN . (3)
Beyond the scope of this work, general equivariance [2,
4, 20, 35, 38, 39, 41, 44, 46, 48, 53] have also been studied. A common way to design a shift-invariant function under
Equivariant networks have also been applied to sets [16, 31, circular P
shifts is via global spatial pooling [25], defined as
34, 36, 51, 54], graphs [9, 10, 19, 26, 27, 30, 33, 42, 52], g(x) = m x[m]. Given a shift-equivariant function f :
spherical images [5, 6, 21], etc. X X X
f (SN x)[m] = Sf (x)[m] = f (x)[m]. (4)
3. Preliminaries m m m
We review the basics before introducing our approach, However, note that ViTs using global spatial pooling after
including the aspects of current ViTs that break shift- extracting features are not shift-invariant, as preceding layers
2
(a) Window-based self-attention (WSA) (b) Proposed adaptive window-based self-attention (A-WSA)
Figure 2. Re-designing window-based self-attention towards shift-equivariance: (a) The window-based self-attention WSA breaks shift
equivariance by selecting windows without considering their input properties. (b) Our proposed adaptive window-based self-attention selects
the best grid of windows based on their average energy, obtaining windows comprised of the same tokens despite input shifts.
(k) ⊤
∈ RW ×D de-
such as tokenization, window-based self-attention, and patch where T̄W = TW k . . . TW (k+1)−1
merging are not shift-equivariant, which we review next. th
notes the k window comprised by W neighboring tokens
Tokenization (token). ViTs split an input x ∈ RN into (W consecutive rows of T ). Note that Eq. (9) uses semi-
non-overlapping patches of length L and project them into a colons (;) as row separators.
latent space to generate tokens. Swin [28, 29] architectures take advantage of WSA to
N decrease the computational cost while adopting a shifting
token(x) = XE ∈ R L ×D (5) scheme (at the window level) to allow long-range connec-
tions. We note that WSA is not shift-equivariant, e.g., any
where E ∈ RL×D is a linear projection and X =
h i⊤ shift that is not a multiple of the window size changes the
N
reshape(x) = X0 . . . X NL −1 ∈ R L ×L is a ma- tokens within each window, as illustrated in Fig. 2a.
trix where each row corresponds to the k th patch of x, i.e., Patch merging (PMerge). Given input tokens T =
⊤
T0 . . . TM −1 ∈ RM ×D and a patch length P , patch
Xk = x[Lk : L(k + 1) − 1] ∈ RL . (6) merging is defined as a linear projection of vectorized token
patches:
Eq. (5) implies that token is not shift-equivariant. Patches
are extracted based on a fixed grid, so different patches will M
PMerge(T ) = T̃ Ẽ ∈ R P ×D̃ (10)
be obtained if the input is shifted, as illustrated in Fig. 1a. h M
i ⊤
Self-Attention (SA). In ViTs, self-attention is defined as with T̃ = vec(T̄P(0) ) . . . vec(T̄P( P −1) ) .
√ ′
SA(T ) = softmax(QK ⊤ / D′ )V ∈ RM ×D (7)
(k)
Here, vec(T̄P ) ∈ RP D is the vectorized version of the
⊤ ⊤
∈ RM ×D denotes input (k)
where T = T0 . . . TM −1 k th patch T̄P = TP k . . . TP (k+1)−1 ∈ RP ×D , and
tokens, and softmax is the softmax normalization along Ẽ ∈ RP D×D̃ is a linear projection.
rows. Queries Q, keys K and values V correspond to: PMerge reduces the number of tokens after applying
Q = T EQ, K = T EK , V = T EV (8) self-attention while increasing the token length, i.e., D̃ > D.
This is similar to the CNN strategy of increasing the number
′
Q/K/V
∈ RD×D . The term of channels using convolutional layers while decreasing their
with linear projections√ E
⊤
softmax QK / D ∈ [0, 1] ′ M ×M
ensures that the out- spatial resolution via pooling. Since patches are selected on
put token is a convex combination of the computed values. a fixed grid, PMerge is not shift-equivariant.
Window-based self-attention (WSA). A crucial limitation of Relative position embedding (RPE). As self-attention is
self-attention is its quadratic computational cost with respect permutation equivariant, spatial information must be explic-
to the number of input tokens M . To alleviate this, window- itly added to the tokens. Typically, RPE adds a position
based self-attention [28] groups tokens into local windows matrix representing the relative distance between queries
and then performs self-attention within each window. Given and keys into self-attention as follows:
input tokens T ∈ RM ×D and a window size W , window-
QK ⊤
′
based self-attention WSA(T ) ∈ RM ×D is defined as: SA(rel) (T ) = softmax √ + E (rel) V (11)
h i D′
(0) ( M −1) (Q) (K)
WSA(T ) = SA T̄W ; . . . ; SA T̄WW (9) with E (rel) [i, j] = B (rel) [pi − pj ]. (12)
3
Here, E (rel) ∈ RM ×M is constructed from an embedding we show that the remainder determines the relation between
(Q) (K)
lookup table B (rel) ∈ R2M −1 and the index [pi − pj ] the L token representations of an input and its shifted version,
denotes the distance between the ith query token at position while the quotient causes a circular shift in the token index.
(Q) (K)
pi and the j th key token at position pj . As embeddings The complete proof is deferred to Appendix Sec. A1.
are selected based on the relative distance, RPE allows ViTs Lemma 1 shows that, for any index m, there exists m̂ =
to capture spatial relationships, e.g., knowing whether two (m + 1) mod L such that X (m) and X̂ (m̂) are equal up to a
tokens are spatially nearby. circular shift. In Claim 1, we use this property to demonstrate
the shift-equivariance of our proposed adaptive tokenization.
4. Truly Shift-equivariant ViT
Claim 1. Shift-equivariance of adaptive tokenization.
In the previous section, we identified components of ViTs If F in Eq. (14) is shift-invariant, then A-token is
that are not shift-equivariant. To achieve shift-equivariance in shift-equivariant, i.e., ∃ mq ∈ {0, . . . , L − 1} s.t.
ViTs, we redesign four modules: tokenization, self-attention, mq
patch merging, and positional embedding. As equivariance A-token SN x = S⌊N/L⌋ A-token(x). (16)
is preserved under compositions, ViTs using these modules
are end-to-end shift-equivariant. Proof. Given m⋆ in Eq. (14), Lemma 1 asserts the existence
⋆
Adaptive tokenization (A-token). Standard tokenization of m̂ such that X̂ (m̂) E = X (m ) E up to a circular shift.
splits an input into patches using a regular grid, breaking Since x and SN x have the same L token representations and
shift-equivariance. We propose a data-dependent alternative assuming a shift-invariant F , we show that A-token(SN x)
that selects patches that maximize a shift-invariant function, is equal to X̂ (m̂) E, which is a circularly shifted version of
resulting in the same tokens regardless of input shifts. A-token(x). See Appendix Sec. A1 for the full proof.
Given an input x ∈ RN and a patch length L, Our adap- Adaptive window-based self-attention (A-WSA). WSA’s
tive tokenization is defined as window partitioning is shift-sensitive, as different windows
⋆ N
A-token(x) = X (m ) E ∈ R L ×D (13) are obtained when the input tokens are circularly shifted by
⋆ (m)
a non-multiple of the window size. Instead, we propose an
with m = arg max F (X E). (14) adaptive token shifting method to obtain a consistent window
m∈{0,...,L−1}
partition. By selecting the offset based on the energy of each
N
Here, X (m) = reshape(SN m
x) ∈ R L ×L is the reshaped possible partition, our method generates the same windows
version of the input circularly shifted by m samples, E ∈ regardless of input shifts.
N ⊤
RL×D is a linear projection and F : R L ×D 7→ R is a shift- Given input tokens T = T0 . . . TM −1 ∈ RM ×D
M
invariant function. Note that the token representation of an and a window size W , let vW ∈ R⌊ W ⌋ denote the average
input is only affected by circular shifts up to the patch size ℓp -norm (energy) of each token window:
L. For any shift greater than L − 1, there is a shift smaller
W −1
than L that generates the same tokens (up to a circular shift). 1 X
So, an input has L different token representations. Fig. 1b vW [k] = ∥T(W k+l) mod M ∥p . (17)
W
l=0
illustrates our proposed shift-equivariant tokenization.
Our adaptive tokenization maximizes a shift-invariant Then, the energy of the windows resulting from shifting the
(m) M
function to ensure the same token representation regardless input tokens by m indices corresponds to vW ∈ R⌊ W ⌋ ,
of input shifts. Next, we analyze an essential property of (m)
where vW [k] is the energy of the k th window:
X (m) E to prove that A-token is shift-equivariant.
W −1
(m) 1 X m
Lemma 1. L-periodic shift-equivariance of tokenization. vW [k] = ∥(SM T )(W k+l) mod M ∥p . (18)
W
Let input x ∈ RN have a token representation l=0
| {z
=T(W k+m+l)
}
mod M
X E ∈ R⌊N/L⌋×D . If x̂ = SN x (a shifted input),
(m)
then its token representation X̂ (m) E corresponds to: Based on the window energy in Eq. (18), we define the
adaptive window-based self-attention as
⌊(m+1)/L⌋
X̂ (m) E = S⌊N/L⌋ X ((m+1) mod L) E (15) m⋆ ′
T ∈ RM ×D
A-WSA(T ) = WSA SM (19)
⋆ (m)
This implies that x and x̂ are characterized by the same with m = arg max G vW (20)
L token representations, up to a circular shift along the m∈{0,...,W −1}
4
Claim 2. If G in Eq. (19) is shift invariant, then A-WSA to encode the distance between the ith query token at po-
(Q) (K)
is shift-equivariant. sition pi and the j th key token at position pj . Here,
B (adapt) ∈ RM is the trainable lookup table comprised by
Proof. Given two groups of tokens related by a circular shift, relative positional embeddings. Note that B (adapt) is smaller
and a shift-invariant function G, shifting each group by its than the original B (rel) ∈ R(2M −1) , since relative distances
maximizer in Eq. (19) induces an offset that is a multiple of are now measured in a circular fashion between M tokens.
W . So, both groups are partitioned in the same windows up Segmentation with equivariant upsampling. Segmentation
to a circular shift. Fig. 2b illustrates this consistent window models with ViT backbones continue to use CNN decoders,
grouping. The proof is deferred to Appendix Sec. A1. e.g., Swin [28] uses UperNet [50]. As explained by Rojas-
Gomez et al. [37], to obtain a shift-equivariant CNN decoder,
Adaptive patch merging (A-PMerge). As reviewed the key is to keep track of the downsampling indices and
in Sec. 3, PMerge consists of a vectorization of P neighbor- use them to put features back to their original positions
ing tokens followed by a projection from RP D to RD̃ . So, during upsampling. Different from CNNs, the proposed ViTs
it can be expressed as a strided convolution with D̃ output involve data-adaptive window selections, i.e., we would also
channels, stride factor P and kernel size P . We use this prop- need to keep track of the window indices and account for
erty to propose a shift-equivariant patch merging. their shifts during upsampling.
Claim 3. PMerge corresponds to a strided convolution
with D̃ output channels, striding P and kernel size P . 5. Experiments
Proof. Expressing the linear projection Ẽ as a convolutional We conduct experiments on image classification and se-
matrix, PMerge is equivalent to a convolution sum with mantic segmentation on four ViT architectures, namely,
kernels comprised by columns of Ẽ. Let
the input tokens Swin [28], SwinV2 [29], CvT [49], and MViTv2 [24]. For
be expressed as T = t0 . . . tD−1 ∈ RM ×D , where
each task, we analyze their performance under circular and
tj ∈ RM corresponds to the j th element of every input token. standard shifts. For circular shifts, the theory matches the
M
Then, PMerge(T ) ∈ R P ×D̃ can be expressed as: experiments and our approach achieves 100% circular shift
consistency (up to numerical errors). We further conduct
PMerge(T ) = D(P ) ( y0 . . . yD̃−1 )
(21) experiments on standard shifts to study how our method per-
D−1
X forms under this theory-to-practice gap, where there is loss
with yk = tj ⊛ h(k,j) ∈ RM (22) of information at the boundaries.
j=0
5
(a) Shifted tokens (b) Original relative distance (c) Proposed relative distance
Figure 3. Shift consistent relative distance: (a) Circularly shifted queries and keys (M = 4). (b) Original relative distance used to build the
RPE matrix: p(Q) [i] − p(K) [j]. Since it does not consider the periodicity of circular shifts, relative distances are not preserved. (c) Proposed
relative distance: (p(Q) [i] − p(K) [j])mod M . Our proposed distance is consistent with circular shifts, leading to a shift equivariant RPE.
Table 1. CIFAR-10/100 classification results: Top-1 accuracy and shift consistency under circular and linear shifts. Bold numbers indicate
improvement over the corresponding baseline architecture. Mean and standard deviation reported on five randomly initialized models.
identical under two different circular shifts. Given a dataset Circular Shift Standard Shift
Method
D = {I}, C-Cons. computes Top-1 Acc. C-Cons. Top-1 Acc. S-Cons.
1 X h i Swin-T 78.5 86.68 81.18 92.41
E∆1 ,∆2 1 ŷ(S ∆1 (I)) = ŷ(S ∆2 (I)) (25) A-Swin-T (Ours) 79.35 99.98 81.6 93.24
|D|
I∈D SwinV2-T 78.95 87.68 81.76 93.24
where 1 denotes the indicator function, ŷ(I) the class pre- A-SwinV2-T (Ours) 79.91 99.98 82.10 94.04
diction for I, S the circular shift operator, and ∆1 = CvT-13 77.01 86.87 81.59 92.80
(h1 , w1 ), ∆2 = (h2 , w2 ) horizontal and vertical offsets. A-CvT-13 (Ours) 77.05 100 81.48 93.41
MViTv2-T 77.36 90.03 82.21 93.88
Results. We report performance in Tab. 1 and Tab. 2 for
A-MViTv2-T (Ours) 77.46 100 82.4 94.08
CIFAR-10/100 and ImageNet, respectively. Overall, we ob-
serve that our adaptive ViTs achieve near 100% shift con- Table 2. ImageNet classification results: Top-1 accuracy and shift
sistency in practice. The remaining inconsistency is caused consistency under circular and linear shifts. Bold numbers indicate
by numerical precision limitations and tie-breaking leading improvement over the corresponding baseline architecture.
to a different selection of tokens or shifted windows. Be-
yond consistency improvements, our method also improves on the original dataset (without any shifts). To quantify
classification accuracy across all settings. shift-invariance, we report the standard shift consistency
(S-Cons.), which follows the same principle as C-Cons
5.2. Image classification under standard shifts in Eq. (25), but uses a standard shift instead of a circular
Experiment setup. To study the boundary effect on shift- one. For CIFAR-10/100, we use zero-padding at the bound-
invariance, we further conduct experiments using standard aries. due to the small image size. For ImageNet, follow-
shifts. As these are no longer circular, the image content ing Zhang [55], we perform an image shift followed by a
may change at its borders, i.e., perfect shift consistency center-cropping of size 224 × 224. This produces realistic
is no longer guaranteed. For CIFAR10/100, input images shifts and avoids a particular choice of padding.
were resized to the resolution used by each model’s original Results. Tabs. 1 and 2 report performance under standard
implementation. Default data augmentation and optimizer shifts on CIFAR-10/100 and ImageNet, respectively. Due
settings were used for each model while training epochs and to changes in the boundary content, our method does not
batch size followed those used in the circular shift settings. achieve 100% shift consistency. However, we persistently
Evaluation metric. We report top-1 classification accuracy observe that the adaptive models outperform their respective
6
Input CvT-13 A-CvT-13 (Ours) CvT-13 A-CvT-13 (Ours) CvT-13 A-CvT-13 (Ours)
(224 × 224) Block 1 Error (56 × 56) Block 1 Error (56 × 56) Block 2 Error (28 × 28) Block 2 Error (28 × 28) Block 3 Error (14 × 14) Block 3 Error (14 × 14)
Figure 4. Consistent token representations. Shifting inputs by a small offset leads to large deviations (non-zero errors) in the representations
when using default ViTs (e.g., CvT-13). In contrast, our proposed models (e.g., A-CvT-13) achieve an absolute zero-error across all blocks.
7
Image Shifted Image SwinV2 + UperNet Predictions A-SwinV2 + UperNet Predictions (Ours)
Figure 5. Segmentation under standard shifts: Our A-SwinV2 + UperNet model improves robustness to input shifts over the original
model, generating consistent predictions while improving accuracy. Examples of prediction changes due to shifts are highlighted in red.
port the mean-Average Segmentation Circular Consistency performance and shift consistency, with a notable improve-
(mASCC) which counts how often the predicted pixel la- ment on SwinV2-T. See Fig. 5 for segmentation results.
bels (after shifting back) are identical under two different
circular shifts. Given a dataset D = {I}, mASCC computes 5.7. Ablation study
" H,W We study the impact of each proposed ViT module by remov-
1 X 1 X h
ing them. These include (i) adaptive tokenization, (ii) adap-
E∆1 ,∆2 1
|D| HW u=1,v=1 tive window-based self-attention, and (iii) adaptive patch
I∈D
i
# merging. We conduct the ablations on our A-Swin-T model
−∆1 ∆1 −∆2 ∆2 trained on CIFAR-10 under circular shifts.
S ŷ(S (I))[u, v] = S ŷ(S (I))[u, v] , (26)
Results are reported in Tab. 6. Our full model improves
shift consistency by more than 3.5% over A-Swin-T without
where H, W correspond to the image height and width, and
A-token, while slightly decreasing classification accuracy
[u, v] indexes the class prediction at pixel (u, v).
by 0.27%. The use of A-WSA improves both classification ac-
Results. Tab. 5 shows classification accuracy and shift con-
curacy and shift consistency. Finally, A-PMerge improves
sistency for UperNet segmenters using Swin-T and SwinV2-
classification accuracy by approximately 1.7% and shift con-
T backbones. Following the theory, our adaptive models
sistency by more than 5%. Overall, all the proposed modules
achieve 100% mASCC (perfect shift consistency), while
are necessary to achieve 100% shift consistency.
improving on segmentation accuracy.
5.6. Semantic segmentation under standard shifts 6. Conclusion
Experiment setup. As in the circular shift scenario, mod- We propose a family of ViTs that are truly shift-invariant
els are trained for 160K iterations with a total batch size of and equivariant. We redesigned four ViT modules, includ-
16 using the default data augmentation. To evaluate shift- ing tokenization, self-attention, patch merging, and posi-
equivariance under standard shifts, we report the mean- tional embedding to guarantee circular shift-invariance and
Average Semantic Segmentation Consistency (mASSC), equivariance theoretically. Using these modules, we made
which computes whether the percentage of predicted pixel Swin, SwinV2, CvT, and MViTv2 versions that are truly
labels (after shifting back) remains under two standard shifts. shift-equivariant. When matching the theoretical setup, these
Note, mASSC ignores the boundary pixels in its computation models exhibit 100% circular shift consistency and superior
as standard shifts lead to changes in boundary content. performance compared to baselines. Furthermore, on stan-
Results. Tab. 5 reports results on the standard shift scenario. dard shifts where image boundaries deviate from the theory,
Due to boundary conditions, perfect shift-equivariance is not our proposed models remain more resilient to shifts with
achieved. Nevertheless, our models improve segmentation task performance on par with/exceeding the baselines.
8
References [17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
learning for image recognition. In Proc. CVPR, 2016.
[1] A. Azulay and Y. Weiss. Why do deep convolutional 2
networks generalize so poorly to small image transfor-
[18] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan,
mations? JMLR, 2019. 2
and M. Shah. Transformers in vision: A survey. ACM
[2] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and
CSUR, 2022. 2
P. Vandergheynst. Geometric deep learning: going
[19] T. N. Kipf and M. Welling. Semi-supervised classifica-
beyond euclidean data. IEEE SPM, 2017. 2
tion with graph convolutional networks. In Proc. ICLR,
[3] A. Chaman and I. Dokmanic. Truly shift-invariant
2017. 2
convolutional neural networks. In Proc. CVPR, 2021.
[20] D. M. Klee, O. Biza, R. Platt, and R. Walters. Image to
2, 5, 7, 19, 21
sphere: Learning equivariant features for efficient pose
[4] T. Cohen and M. Welling. Group equivariant convolu-
prediction. In Proc. ICLR, 2023. 2
tional networks. In Proc. ICML, 2016. 2
[21] R. Kondor, Z. Lin, and S. Trivedi. Clebsch–Gordan
[5] T. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling.
Nets: a fully Fourier space spherical convolutional neu-
Gauge equivariant convolutional networks and the
ral network. In Proc. NeurIPS, 2018. 2
icosahedral CNN. In Proc. ICML, 2019. 2
[22] A. Krizhevsky, G. Hinton, et al. Learning multiple
[6] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling.
layers of features from tiny images. 2009. 1, 5
Spherical CNNs. In Proc. ICLR, 2018. 2
[7] M. Contributors. MMSegmentation: Openmm- [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
lab semantic segmentation toolbox and bench- ageNet classification with deep convolutional neural
mark. https://github.com/open-mmlab/ networks. In Proc. NeurIPS, 2012. 2
mmsegmentation, 2020. 20 [24] Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Ma-
[8] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Ran- lik, and C. Feichtenhofer. MViTv2: Improved multi-
daugment: Practical automated data augmentation with scale vision transformers for classification and detec-
a reduced search space. In Proc. CVPR workshop, tion. In Proc. CVPR, 2022. 1, 5
2020. 19 [25] M. Lin, Q. Chen, and S. Yan. Network in network.
[9] P. de Haan, M. Weiler, T. Cohen, and M. Welling. arXiv preprint arXiv:1312.4400, 2013. 2
Gauge equivariant mesh CNNs: Anisotropic convo- [26] I.-J. Liu, R. A. Yeh, and A. G. Schwing. PIC: permuta-
lutions on geometric graphs. In Proc. ICLR, 2021. 2 tion invariant critic for multi-agent deep reinforcement
[10] M. Defferrard, X. Bresson, and P. Vandergheynst. Con- learning. In Proc. CORL, 2020. 2
volutional neural networks on graphs with fast local- [27] I.-J. Liu, Z. Ren, R. A. Yeh, and A. G. Schwing. Se-
ized spectral filtering. In Proc. NeurIPS, 2016. 2 mantic tracklets: An object-centric representation for
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and visual multi-agent reinforcement learning. In Proc.
L. Fei-Fei. ImageNet: A large-scale hierarchical image IROS, 2021. 2
database. In Proc. CVPR, 2009. 1, 5 [28] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin,
[12] P. Ding, D. Soselia, T. Armstrong, J. Su, and F. Huang. and B. Guo. Swin transformer: Hierarchical vision
Reviving shift equivariance in vision transformers. transformer using shifted windows. In Proc. ICCV,
arXiv preprint arXiv:2306.07470, 2023. 2 2021. 1, 3, 5, 7, 18
[13] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- [29] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning,
senborn, X. Zhai, T. Unterthiner, M. Dehghani, Y. Cao, Z. Zhang, L. Dong, et al. Swin transformer
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and v2: Scaling up capacity and resolution. In Proc. CVPR,
N. Houlsby. An image is worth 16x16 words: Trans- 2022. 1, 3, 5, 18
formers for image recognition at scale. In Proc. ICLR, [30] H. Maron, H. Ben-Hamu, N. Shamir, and Y. Lipman.
2021. 1 Invariant and equivariant graph networks. In Proc.
[14] H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, ICLR, 2019. 2
and C. Feichtenhofer. Multiscale vision transformers. [31] H. Maron, O. Litany, G. Chechik, and E. Fetaya. On
In Proc. CVPR, 2021. 1, 18 learning sets of symmetric elements. In Proc. ICML,
[15] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, 2020. 2
Y. Tang, A. Xiao, C. Xu, Y. Xu, et al. A survey on [32] H. Michaeli, T. Michaeli, and D. Soudry. Alias-free
vision transformer. IEEE TPAMI, 2022. 2 convnets: Fractional shift invariance via polynomial
[16] J. Hartford, D. Graham, K. Leyton-Brown, and S. Ra- activations. In Proc. CVPR, 2023. 2
vanbakhsh. Deep models of interactions across sets. In [33] C. Morris, G. Rattan, S. Kiefer, and S. Ravanbakhsh.
Proc. ICML, 2018. 2 SpeqNets: Sparsity-aware permutation-equivariant
9
graph networks. In K. Chaudhuri, S. Jegelka, L. Song, [50] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified
C. Szepesvari, G. Niu, and S. Sabato, editors, Proc. perceptual parsing for scene understanding. In Proc.
ICML, 2022. 2 ECCV, 2018. 5, 7
[34] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: [51] R. A. Yeh, Y.-T. Hu, and A. Schwing. Chirality nets
Deep learning on point sets for 3D classification and for human pose regression. In Proc. NeurIPS, 2019. 2
segmentation. In Proc. CVPR, 2017. 2 [52] R. A. Yeh, A. G. Schwing, J. Huang, and K. Murphy.
[35] S. Ravanbakhsh, J. Schneider, and B. Póczos. Equivari- Diverse generation for multi-agent sports games. In
ance through parameter-sharing. In Proc. ICML, 2017. Proc. CVPR, 2019. 2
2 [53] R. A. Yeh, Y.-T. Hu, M. Hasegawa-Johnson, and
[36] S. Ravanbakhsh, J. Schneider, and B. Poczos. Deep A. Schwing. Equivariance discovery by learned
learning with sets and point clouds. In Proc. ICLR parameter-sharing. In Proc. AISTATS, 2022. 2
workshop, 2017. 2 [54] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R.
[37] R. A. Rojas-Gomez, T. Y. Lim, A. G. Schwing, M. N. Salakhutdinov, and A. J. Smola. Deep sets. In Proc.
Do, and R. A. Yeh. Learnable polyphase sampling for NeurIPS, 2017. 2
shift invariant and equivariant convolutional networks. [55] R. Zhang. Making convolutional networks shift-
In Proc. NeurIPS, 2022. 2, 5, 19 invariant again. In Proc. ICML, 2019. 2, 6
[38] D. Romero, E. Bekkers, J. Tomczak, and M. Hoogen- [56] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and
doorn. Attentive group equivariant convolutional net- A. Torralba. Scene parsing through ADE20K dataset.
works. In Proc. ICML, 2020. 2 In Proc. CVPR, 2017. 1, 7
[39] D. W. Romero and S. Lohit. Learning partial equivari- [57] X. Zou, F. Xiao, Z. Yu, and Y. J. Lee. Delving deeper
ances from data. In Proc. NeurIPS, 2022. 2 into anti-aliasing in convnets. In Proc. BMVC, 2020. 2
[40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-
C. Chen. MobileNetV2: Inverted residuals and linear
bottlenecks. In Proc. CVPR, 2018. 2
[41] M. Shakerinava and S. Ravanbakhsh. Equivariant net-
works for pixelized spheres. In Proc. ICML, 2021.
2
[42] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega,
and P. Vandergheynst. The emerging field of signal
processing on graphs: Extending high-dimensional data
analysis to networks and other irregular domains. IEEE
SPM, 2013. 2
[43] K. Simonyan and A. Zisserman. Very deep convolu-
tional networks for large-scale image recognition. In
Proc. ICLR, 2015. 2
[44] T. van der Ouderaa, D. W. Romero, and M. van der
Wilk. Relaxing equivariance constraints with non-
stationary continuous filters. In Proc. NeurIPS, 2022.
2
[45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin.
Attention is all you need. In Proc. NeurIPS, 2017. 1
[46] S. R. Venkataraman, S. Balasubramanian, and R. R.
Sarma. Building deep equivariant capsule networks. In
Proc. ICLR, 2020. 2
[47] M. Vetterli, J. Kovačević, and V. K. Goyal. Founda-
tions of signal processing. Cambridge University Press,
2014. 2
[48] M. Weiler and G. Cesa. General E(2)-equivariant steer-
able CNNs. In Proc. NeurIPS, 2019. 2
[49] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan,
and L. Zhang. CvT: Introducing convolutions to vision
transformers. In Proc. CVPR, 2021. 1, 5
10
Appendix
The appendix is organized as follows:
• Sec. A1 provides the complete proofs for all the claims stated in the main paper.
• Sec. A2 includes additional implementation, runtime details, and memory consumption.
• Sec. A3 fully describes the augmentation settings used for image classification and semantic segmentation.
• Sec. A4 provides additional semantic segmentation qualitative results.
• Sec. A5 reports additional image classification results covering:
a. The robustness of our proposed models to out-of-distribution images (Sec. A5.1).
b. The sensitivity of our ViT models to input shifts of different magnitudes (Sec. A5.2).
c. The use of our proposed adaptive modules on pre-trained ViTs (Sec. A5.3).
This implies that x and x̂ are characterized by the same L token representations, up to a circular shift along the token index
(row index of X ((m+1) mod L) E).
Proof. By definition,
m+1
X̂ (m) = reshape(SN
m
x̂) = reshape(SN x) (A27)
h i
(m)
Let the input patches be expressed as X (m) = r0(m) ...
(m)
rL−1 ∈ R⌊N/L⌋×L , where rk ∈ R⌊N/L⌋ is comprised by the
k th element of every input patch.
(m) m
rk [n] = (SN x)[Ln + k] = x[(Ln + k + m) mod N ]. (A28)
(m)
More precisely, given a circularly shifted input with offset m, rk [n] represents the k th element of the nth patch. Based on
this, X̂ (m) ∈ R⌊N/L⌋×L can be expressed as:
h i
X̂ (m) = reshape(SN m+1
x) = r0(m+1) . . . rL−1(m+1)
, (A29)
(m+1)
with rk [n] = x[(Ln + k + m + 1) mod N ]. (A30)
⌊ m+1 ⌋
X̂ (m) E = S⌊ N L⌋ X ((m+1) mod L) E. (A32)
L
11
A1.2. Proof of Claim 1
Claim 1. Shift-equivariance of adaptive tokenization.
If F in Eq. (14) is shift-invariant, then A-token is shift-equivariant, i.e., ∃ mq ∈ {0, . . . , L − 1} s.t.
mq
A-token SN x = S⌊N/L⌋ A-token(x). (16)
Proof. Let x̂ = SN x ∈ RN be a circularly shifted version of input x ∈ RN . From Lemma 1, their token representations
satisfy:
⌊ m+1 ⌋
X̂ (m) E = S⌊ N L⌋ X ((m+1) mod L) E. (A33)
L
where the right-hand side in Eq. (A35) derives from the shift-invariance property of F . Note that for any integer m in the range
{0, . . . , L − 1}, the circular shift (m + 1) mod L also lies within the same range. It follows that:
F X ((m+1) mod L) E = F X (m) E .
max max (A36)
m∈{0,...,L−1} m∈{0,...,L−1}
which implies that m⋆ = (m̂ + 1) mod L. Finally, from Lemma 1 and Eq. (A38):
(m̂+1) (m̂+1)
⌊ ⌋ ⌊ ⌋
A-token(SN x) = S⌊ N ⌋L X ((m̂+1) mod L) E = S⌊ N ⌋L A-token(x). (A39)
L L
Proof. Let T ∈ RM and T̂ = SM T ∈ RM denote two token representations related by a circular shift. From the definition
(m)
of vW in Eq. (17), let v ∈ RM denote the average ℓp -norm (energy) of each group of W neighboring tokens in T . More
precisely, the k-th component of the energy vector v, denoted as v[k], is the energy of the window comprised by W neighboring
tokens starting at index k.
W −1
1 X
v[k] = ∥T(k+l) mod M ∥p (A40)
W
l=0
(m) M (m) M
Similarly, following Eq. (18), let vW ∈ R⌊ W ⌋ and v̂W ∈ R⌊ W ⌋ denote the energy vectors of non-overlapping windows
obtained after shifting T and T̂ by m indices, respectively. Then, maximizers m⋆ and m̂ satisfy
(m) (m)
m⋆ = arg max G vW , m̂ = arg max G v̂W (A41)
m∈{0,...,W −1} m∈{0,...,W −1}
12
In what follows, we prove that their adaptive window-based self-attention outputs are related by a circular shift that is a
multiple of the window size W , i.e., there exists m0 ∈ Z such that
m0 W
A-WSA(SM T ) = SM A-WSA(T ). (A42)
Given the shifted input token representation T̂ = SM T , the energy of each group of W neighboring tokens corresponds to
W −1 W −1
1 X 1 X
v̂[k] = ∥(SM T )(k+l) mod M ∥p = ∥T(k+l+1) mod M ∥p (A43)
W W
l=0 l=0
= v[k + 1] (A44)
(m)
which implies v̂ = SM v. Then, v̂W can be expressed as
(m)
v̂W [k] = v̂[W k + m] = v[W k + m + 1]. (A45)
Expressing m + 1 in terms of its quotient and remainder for divisor W
(m) m+1
v̂M [k] = v W k + ⌊ ⌋ + (m + 1) mod W (A46)
W
⌊ m+1 ⌋ ((m+1) mod W )
= S⌊ MW⌋ vW . (A47)
W
Based on the shift-invariant property of G and the fact that (m + 1) mod W ∈ {0, . . . , W − 1}, A-WSA selection criterion
corresponds to
(m) ⌊ m+1 ⌋ ((m+1) mod W )
max G v̂W = max G S⌊ MW⌋ vW (A48)
m∈{0,...,W −1} m∈{0,...,W −1} W
((m+1) mod W )
= max G vW (A49)
m∈{0,...,W −1}
(m)
= max G vW . (A50)
m∈{0,...,W −1}
which implies m⋆ = (m̂ + 1) mod W . It follows that the adaptive self-attention of T̂ can be expressed as:
m̂ m̂+1
A-WSA(T̂ ) = WSA(SM T̂ ) = WSA(SM T) (A52)
⋆
⌊(m̂+1)/W ⌋W +m
= WSA SM T (A53)
⌊(m̂+1)/W ⌋W m⋆
= WSA SM SM T . (A54)
⌊(m̂+1)/W ⌋W
Since SM corresponds to a circular shift by a multiple of the window size W
⌊(m̂+1)/W ⌋W ⋆
m
A-WSA(T̂ ) = SM WSA(SM T) (A55)
where the right-hand side of Eq. (A55) stems from the fact that, for WSA with a window size W , circularly shifting the input
tokens by a multiple of W results in an identical circular shift of the output tokens. Finally, from the definition of adaptive
self-attention in Eq. (19)
m0 W
A-WSA(T̂ ) = SM A-WSA(T ), m0 = ⌊(m̂ + 1)/W ⌋. (A56)
Claim 2 shows that A-WSA induces an offset between token representations that is a multiple of the window size W .
As a result, windows are comprised by the same tokens, despite circular shifts. This way, A-WSA guarantees that an input
token representation and its circularly shifted version are split into the same token windows, leading to a circularly shifted
self-attention output.
13
A1.4. Proof of Claim 3
Claim 3. PMerge corresponds to a strided convolution with D̃ output channels, striding P and kernel size P .
Proof. Let the input tokens be expressed as T = t0 . . . tD−1 ∈ RM ×D , where tj ∈ RM is comprised by the j th
th D
of every inputtoken. This implies that, given the l token Tl ∈ R , Tl [j] = tj [l]. Then, the patch merging output
element
Z = z0 . . . zD̃−1 can be expressed as
M
Z = PMerge(T ) = D(P ) (Y ) ∈ R P ×D̃ , (A57)
M
where D(P ) ∈ R P ×M is a downsampling operator of factor P , and Y = y0 yD̃−1 ∈ RM ×D̃ is the output of a
...
convolution sum with kernel size P and D̃ output channels
D−1
X
yk = tj ⊛ h(k,j) ∈ RM (A58)
j=0
h i⊤
where ⊛ denotes circular convolution and h(k,j) = h(k,j)
0 ...
(k,j)
hP −1 ∈ RP denotes a convolutional kernel. Note that
the convolution sum in Eq. (A58) involves D · D̃ kernels, given that k ∈ {0, . . . , D̃ − 1} and j ∈ {0, . . . , D − 1}. Without
loss of generality, assume the number of input tokens M is divisible by the patch length P . Then, due to the linearity of the
M
downsampling operator, zk ∈ R P corresponds to
D−1
X
zk = D(P ) tj ⊛ h(k,j) . (A59)
j=0
Then, note that each summation term in Eq. (A59) can be expressed as a matrix-vector multiplication
D−1
X
zk = H̃ (k,j) tj (A61)
j=0
M
where H̃ (k,j) = D(P ) H (k,j) ∈ R P ×M corresponds to the convolutional matrix H (k,j) downsampled by a factor P along
the row index
(k,j) (k,j)
h0 · · · hP −1
(k,j) (k,j)
h0 · · · hP −1
(k,j)
H̃ =
..
. (A62)
.
(k,j) (k,j)
h0 · · · hP −1
Based on this, the convolution sum in Eq. (A59) can be expressed in matrix-vector form as
t0
zk = H̃ (k,0) · · · H̃ (k,D−1) ... .
(A63)
tD−1
14
Then, from the patch representation of T in Eq. (10), zk can be alternatively expressed as
zk = T̃ ẽk (A64)
Claim 3 shows that the original patch merging function PMerge is equivalent to projecting all M overlapping patches of
length P , which can be expressed as a circular convolution, followed by keeping only M P of the resulting tokens. Note that the
selection of such tokens is done via a downsampling operation of factor P , which is not the only way of selecting them. In fact,
there are P different token representations that can be selected, as explained in Sec. 4.
Based on this observation, the proposed A-PMerge introduced in Sec. 4 takes advantage of the polyphase decomposition
to select the token representation in a data-dependent fashion, leading to a shift-equivariant patch merging scheme.
15
conv_padding_mode=conv_padding_mode,
stride=stride,
window_selection=None,
window_size=None,
)
# Global pooling and head
self.avgpool=nn.AdaptiveAvgPool2d((1,1))
self.fc=nn.Linear(dim*stride,num_classes)
def forward(self,x):
# Reshape
B,C,H,W = x.shape
x = x.permute(0,2,3,1).reshape(B,H*W,C)
# Adaptive patch merge
x = self.apmerge(x)
# Reshape back
x = x.reshape(
B,H//self.stride,W//self.stride,C*self.stride,
).permute(0,3,1,2)
# Global average pooling
x = torch.flatten(self.avgpool(x),1)
# Classification head
x = self.fc(x)
return x
# Input tokens
B,C,H,W = 1,3,8,8
stride = 2
x = torch.randn(B,C,H,W).cuda().double()
# Shifted input
shift = torch.randint(-3,3,(2,))
x_shift = torch.roll(input=x,dims=(2,3),shifts=(shift[0],shift[1]))
# A-pmerge classifier
model = ApmergeClassifier(
stride=stride,
input_resolution=(H,W),
dim=C).cuda().double().eval()
# Predict
y = model(x)
y_shift = model(x_shift)
err = torch.norm(y-y_shift)
assert(torch.allclose(y,y_shift))
# Check circularly shift invariance
print('y: {}'.format(y))
print('y_shift: {}'.format(y_shift))
print("error: {}".format(err))
Out:
y : tensor ([[ −0.1878 , 0.2348 , 0.0982 , −0.2191]] ,
d e v i c e = ' cuda : 0 ' , dtype = t o r c h . f l o a t 6 4 ,
g r a d _ f n =<AddmmBackward0 > )
16
y _ s h i f t : tensor ([[ −0.1878 , 0.2348 , 0.0982 , −0.2191]] ,
d e v i c e = ' cuda : 0 ' , dtype = t o r c h . f l o a t 6 4 ,
g r a d _ f n =<AddmmBackward0 > )
error : 0.0
# Simple encoder-decoder
class EncDec(nn.Module):
def __init__(self,dims,stride,win_sel,win_sz,pad):
super().__init__()
self.pad = pad
self.stride = stride
self.win_sz = win_sz
def forward(self,x,hw_shape):
B,C,H,W = x.shape
# Reshape
x = x.permute(0,2,3,1).reshape(B,H*W,C)
# A-pmerge + A-WSA
x_w,hw_shape,idx = self.apmerge(x=x,hw_shape=hw_shape,ret_indices=True)
# Keep merging and windowing indices
idx = idx[:3]
# Reshape back
x_w = x_w.reshape(
B,H//self.stride,W//self.stride,C*self.stride).permute(0,3,1,2)
# Unroll and Unpool
y = unroll_unpool_multistage(
x=x_w,
scale_factor=[self.stride],
17
unpool_layer=self.unpool,
idx=[idx],
unpool_winsel_roll=self.pad,
)
return y
# Input
B,C,H,W = 1,3,28,28
shift_max = 6
x = torch.randn(B,C,H,W).cuda().double()
# Offsets
s01 = torch.randint(low=-shift_max,high=shift_max,size=(1,2)).tolist()[0]
s02 = torch.randint(low=-shift_max,high=shift_max,size=(1,2)).tolist()[0]
s03 = [s02[0]-s01[0],s02[1]-s01[1]]
# Shifted inputs
x01 = torch.roll(x,shifts=s01,dims=(-1,-2))
x02 = torch.roll(x,shifts=s02,dims=(-1,-2))
# Build encoder-decoder
# Use A-WSA (poly_win)
model = EncDec(dims=3,stride=2,pad='circular',win_sz=7,win_sel=poly_win,
).cuda().double().eval()
# Predictions
y01 = model(x01,hw_shape=(H,W))
y02 = model(x02,hw_shape=(H,W))
# Shift to compare
z = torch.roll(y01,shifts=s03,dims=(-1,-2))
err = torch.norm(z-y02)
assert torch.allclose(z,y02)
print("torch.norm(z-y02): {}".format(err))
Out:
t o r c h . norm ( z −y02 ) : 0 . 0
18
Training Inference
Model
Max. Allocated Memory (MiB) Max. Reserved Memory (MiB) Max. Allocated Memory (MiB) Max. Reserved Memory (MiB)
Swin-T 4, 935 5, 380 1, 012 1, 352
A-Swin-T (Ours) 5, 178 (+4.92%) 5, 678 (+5.54%) 1, 012 (0%) 1, 382 (+2.22%)
SwinV2-T 7, 712 8, 016 1, 275 1, 544
A-SwinV2-T (Ours) 8, 113 (+5.2%) 8, 710 (+8.66%) 1, 275 (0%) 1, 534 (−0.65%)
CvT-13 5, 794 6, 044 1, 587 1, 850
A-CvT-13 (Ours) 6, 257 (+7.99%) 6, 528 (+8.01%) 1, 587 (0%) 1, 780 (−3.78%)
MViTv2-T 6, 335 7, 352 1, 944 3, 356
A-MViTv2-T (Ours) 7, 364 (+16.24%) 8, 456 (+15.02%) 2, 164 (+11.32%) 3, 378 (+0.67%)
Table A1. Memory Consumption: Maximum allocated and reserved memory required by our adaptive ViTs and their default versions.
Training and inference memory consumption is calculated on a single NVIDIA Quadro RTX 5000 GPU (batch size 64, default image size
per model), and reported in Megabytes (MiB). We also report the relative change with respect to the default models (%), which is shown in
parentheses.
decreases by 3%.
Overall, the training memory consumption marginally increases with respect to the default models, while the inference
memory consumption remains almost unaffected.
19
Train Set Test Set
(i) Resize (Swin: 256 × 256, SwinV2: 292 × 292)
Size Preprocessing
(ii) Center Crop (Swin: 224 × 224, SwinV2: 256 × 256)
(i) Random Horizontal Flipping
(ii) RandAugment
(magnitude: 9
increase severity: True,
Augmentation Normalization
augmentations per image: 2,
standard deviation: 0.5)
(iii) Normalization
(iv) RandomErasing
Table A3. Swin / SwinV2 Image Classification preprocessing (standard shifts). Data preprocessing and augmentation used to train and
test the default (Swin, SwinV2) and adaptive (A-Swin, A-SwinV2) models.
Table A4. MViTv2 Image Classification preprocessing (standard shifts). Data preprocessing and augmentation used to train and test the
default (MViTv2) and adaptive (A-MViTv2) models.
Table A5. CvT Image Classification preprocessing (standard shifts). Data preprocessing and augmentation used to train and test the
default (CvT) and adaptive (A-CvT) models.
For Swin-V2+UperNet, both the baseline and the adaptive (A-SwinV2) models were trained by resizing input images to
size 2, 048 × 512 followed by a random cropping of size 512 × 512. Following this, the default data augmentation used in the
Swin + UperNet model is adopted. During testing, input images are resized to 2, 048 × 512. Then, each dimension size is
rounded up to the next multiple of 256. Tab. A7 describes the pre-processing and data augmentation pipelines.
Data pre-processing (Standard Shifts). For the Swin+UperNet baseline and adaptive models, the default pre-processing
pipeline from the official MMseg implementation [7] was used. The same pipeline was used to evaluate the SwinV2+UperNet
baseline and adaptive models. The pre-processing pipeline is detailed in Tab. A8.
Computational settings. ADE20K semantic segmentation experiments using Swin and SwinV2 models, both adaptive and
baseline architectures, were trained on four NVIDIA A100 GPUs with an effective batch size of 16 images for 160, 000
20
Train Set Test Set
(i) Resize: 1, 792 × 448 (i) Resize: 1, 792 × 448
Size Preprocessing
(ii) Random Crop: 448 × 448 (ii) Resize to next multiple of 224
(i) Random Horizontal Flipping
Augmentation (ii) Photometric Distortion Normalization
(iii) Normalization
Table A6. Swin + UperNet Semantic Segmentation preprocessing (circular shifts). Data preprocessing and augmentation used to train
and test the default (Swin+UperNet) and adaptive (A-Swin+UperNet) models.
Table A7. SwinV2 + UperNet Semantic Segmentation preprocessing (circular shifts). Data preprocessing and augmentation used to
train and test the default (SwinV2+UperNet) and adaptive (A-SwinV2+UperNet) models.
Table A8. Semantic Segmentation preprocessing (standard shifts). Data preprocessing and augmentation used to train and test the default
and adaptive versions of both SwinV2+UperNet and A-SwinV2+UperNet models.
iterations. These settings are consistent with the official MMSeg configuration for the Swin+UperNet model.
21
Model max = 0 max = 28 max = 42 max = 56 max = 70
Acc. C-Cons. Top-1 Acc. C-Cons. Top-1 Acc. C-Cons. Top-1 Acc. C-Cons. Top-1 Acc. C-Cons.
Swin-T (Default) 90.21 82.69 90.16 81.75 89.71 81.42 89.02 80.4 87.83 79.58
Swin-T DA 92.32 94.3 91.81 94.19 91.6 94.16 90.42 93.01 89.23 92.73
A-Swin-T (Ours) 93.53 100 93.27 100 92.71 100 91.57 100 90.3 100
Table A9. Performance on images with randomly erased patches. Top-1 classification accuracy (%) and shift consistency (%) on
CIFAR-10 test images with randomly erased square patches. Max corresponds to the largest possible patch size, as sampled from a uniform
distribution U {0, max}.
Model Offset ∈ {0, . . . , 8} Offset ∈ {0, . . . , 16} Offset ∈ {0, . . . , 24} Offset ∈ {0, . . . , 32}
Swin-T 92.00 ± .23 89.65 ± .9 88.93 ± .09 88.13 ± .14
A-Swin-T (Ours) 100 100 100 100
SwinV2-T 92.13 ± .04 90.43 ± .17 89.67 ± .08 88.75 ± .15
A-SwinV2-T (Ours) 100 100 100 100
CvT-13 88.99 ± .1 87.42 ± 0.16 86.96 ± .1 86.84 ± .06
A-CvT-13 (Ours) 100 100 100 100
MViTv2-T 91.49 ± .04 90.57. ± .11 90.46 ± .08 90.22 ± .1
A-MViTv2-T (Ours) 100 100 100 100
Table A11. Consistency under different shift magnitudes. Shift consistency (%) of our adaptive ViT models under small, medium, large,
and very large shifts. Models trained and evaluated on CIFAR-10 under a circular shift assumption.
consistency by at least 1.5%, our A-Swin-T preserves its perfect shift consistency. Our adaptive model also gets the best
classification accuracy in all cases, improving by more than 1%. This suggests that, despite not being explicitly trained on this
transformation, our A-Swin model is more robust than the default and DA models, obtaining better accuracy and consistency
across scenarios.
On the other hand, Tab. A10 shows the shift consistency and classification accuracy of the three Swin models evaluated
under vertically flipped images. Even in such a challenging case and without any fine-tuning, our A-Swin-T model retains its
perfect shift consistency, improving over the default and data-augmented models by at least 5%. Our adaptive model also
outperforms the default and data-augmented models in terms of classification accuracy, both on flipped and unflipped images,
by a significant margin.
22
Model Top-1 Acc.(%) C-Cons.(%)
CvT-13 (Default) 90.12 76.54
CvT-13 + Adapt (No Fine-tuning) 57.4 100
CvT-13 + Adapt (10 epoch Fine-tuning) 91.72 100
CvT-13 + Adapt (20 epoch Fine-tuning) 92.35 100
A-CvT-13 (Ours) 93.87 100
Table A12. Incorporating shift-equivariant modules on pre-trained ViTs. Our shift-equivariant ViT framework allows plugging-in shift
equivariant modules on pre-trained models (e.g. CvT-13), improving on classification accuracy after a few fine-tuning iterations while
preserving its perfect shift consistency. Results shown on CvT-13 trained on CIFAR-10 under a circular shift assumption.
23
Image Shifted Image Swin + UperNet Predictions A-Swin + UperNet Predictions (Ours)
Figure A1. Swin + UperNet Semantic Segmentation under standard shifts: Semantic segmentation results on the ADE20K dataset
(standard shifts) via Swin backbones: Our A-Swin + UperNet model is more robust to input shifts than the original Swin + UperNet model,
generating consistent predictions while improving accuracy. Examples of prediction changes due to input shifts are boxed in yellow.
24
Image Shifted Image SwinV2 + UperNet Predictions A-SwinV2 + UperNet Predictions (Ours)
Figure A2. SwinV2 + UperNet Semantic Segmentation under standard shifts: Additional semantic segmentation results on the ADE20K
dataset (standard shifts) via SwinV2 backbones: Our adaptive model improves both segmentation accuracy and shift consistency with respect
to the original SwinV2 + UperNet model. Examples of prediction changes due to input shifts are boxed in yellow.
25