0% found this document useful (0 votes)

52 views14 pages

NeurIPS 2024

This document presents a framework for adaptive tokenization in Vision Transformers (ViTs), addressing the limitations of static patch-based tokenization by introducing a learnable tokenizer that enhances performance through pixel-level granularity and adaptability. The proposed method allows for effective handling of under- and overtokenization, ensuring better alignment with dense prediction tasks while maintaining compatibility with existing transformer architectures. Key contributions include a differentiable hierarchical graph-based superpixel method and a feature extraction framework that can be integrated into pre-trained models with minimal fine-tuning.

Uploaded by

martinekbh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views14 pages

NeurIPS 2024

Uploaded by

martinekbh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Adaptive Tokenization in Vision Transformers

Anonymous Author(s)
Affiliation
Address
email

Abstract

1 Contrarily to transformers for language, Vision Transformers (ViTs) apply a static

2 approach to tokenization, lacking capabilities to dynamically adapt to characteris-
3 tics inherent in the data. Providing ViTs the ability to adapt to input could improve
4 performance by enabling tokens to be flexible and specific to each task. In this
5 work, we establish a framework for learnable tokenization that is commensurable
6 with the general transformer backbone. We demonstrate our framework by propos-
7 ing a learnable tokenizer to train models with unique inherent benefits, including
8 pixel-level granularity for better alignment with dense prediction tasks, strong
9 shape and scale invariance properties, and show that our tokenizer can be applied
10 as a drop-in replacement for patch-based tokenization in pre-trained ViT models.

11 1 Introduction

12 The transformer architecture epitomizes significant modeling advances for language [6, 11, 17, 23, 24]
13 as well as vision tasks [7, 16]. In the language processing pipeline, tokenization serves as a crucial
14 preprocessing step to strike a balance between effective vocabulary size and redundancy using
15 information theoretic measures [12, 20, 21]. Unlike their language counterparts, Vision Transformers
16 (ViTs) lack similar data-driven mechanisms for tokenization, instead extracting tokens by fixed
17 uniformly-sized patches. While this strategy has been shown to be effective [5, 14, 22], the approach
18 also has inherent limitations. While informational content varies between images, patch-based
19 tokenization always yields an equal number of tokens for images of the same size. This is problematic,
20 as higher numbers of tokens empirically indicate stronger predictive performance [7, 14, 25].
21 Simply put, patch-based tokenization is unable to capture the informational content in the data. This
22 manifests as two distinct phenomena. The first stems from undertokenization, where patches lack
23 the spatial granularity required for effectively modeling the task. Typically, grid-based extraction of
24 patches yields low resolution feature maps, which are inadequate for dense predictions such as image
25 segmentation. Conversely, overtokenization arises from an inefficient utilization of data redundancy,
26 leading to an excessive number of tokens over regions with minimal informational content, increasing
27 computational complexity without proportionate performance gains. These concepts draw direct
28 parallels to under- and oversegmentation in classical image processing [8], while reflecting the
29 trade-offs faced in optimizing subword tokenization in language.
30 Several works [] aim to tackle these issues. Most existing works focus on remedial steps, often
31 involving alterations of the general transformer architecture. While innovative, such approaches must
32 be carefully balanced as to retain the predictive performance and strong multimodal properties [18]
33 of the general transformer architecture. We posit that such approaches indirectly addresses what is
34 fundamentally a limitation of the general tokenization strategy. In this work, we take a step back to
35 re-evaluate the role of static patch-based tokenization in standard ViTs while maintaining the standard
36 architecture and allowing for modularity and extensions into existing architectures.

Submitted to 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Do not distribute.
37 In summary, our main contributions include;
38 • A generalized framework for adaptable vi- SuperToken 4
[10]
ContextCluster 4
[15]

39 sual tokenization in ViTs, providing a richer

Architectural Coupling
40 space of models better equipped for handling 4 ToMe
[1]
41 issues of under- and overtokenization.
42 • A highly parallelizable differentiable hier-
43 archical graph-based superpixel method for
44 effective visual tokenization.
45 • A feature extraction framework using ker- 4 MSViT
[9]

46 nelized positional encoding that adapts to ir- 4 Quadformer 4 SPiT

[19]
47 regular shapes, generalizes the patch-based tok- GPT-3 _
enization in ViTs, and can be applied as a drop-
[2]
48
49 in replacement in pre-trained models with min-
50 imal fine-tuning. ò Token Granularity ó

51 Notation: We let H ×W = (y, x) : 1 ≤ y ≤ Figure 1: Taxonomy of adaptive tokenization in
52 h, 1 ≤ x ≤ w denote the coordinates of an im- transformers. Tokenization ranges from decoupled
53 age of spatial dimension (h, w), and let I be an () to coupled () to the transformer architecture,
54 index set for the mapping i 7→ (y, x). We con- and from coarse (ò) to fine (ó) token granularity.
55 sider a C-channel image as a map ξ : I → RC , To contextualize vision models (4) with LLMs
56 generally assuming C = 3. We use the stan- (_), GPT-3 [2] is included for reference.
57 dard vectorization operator vec : Rd1 ×...×dn → Rd1 ...dn , and denote function composition by
58 f (g(x)) = (f ◦ g)(x). The transitive closure of a relation R is denoted by R+ .

59 2 On Tokenization in Vision Transformers

60 We start by outlining a high level overview of a generalized framework for visual tokenization. The
61 tokenization strategy in standard ViTs is based on partitioning an image into fixed patches of size
62 ρ×ρ. These patches are vectorized to yield flattened feature representations. Note that this step
63 differs from language processing pipelines, where tokens are numerically represented solely through
64 their learnable embeddings. Contrarily, image data is inherently numeric, necessitating a feature
65 extraction mechanism, hence tokenization in ViTs project features to a fixed embedding size using a
66 linear transformation. Positional encodings are subsequently added and additional class or register
67 tokens [4] are concatenated, yielding token embeddings of the patches extracted from the image.
68 Notably, this is equivalent to a convolutional layer with kernel size k and stride s such that k = s = ρ.
69 From this, we observe that the standard ViT tokenization process can be deconstructed into three steps.
70 The tokenization operator T maps an image ξ to a set of spatially connected subregions determining
71 the overall granularity of the tokenized representation. A feature extractor F is then required to map
72 image features to a common inner product space. Finally, the embedder E projects each region to a
73 fixed embedding size, handling positional encodings and class tokens. This yields a general visual
74 tokenization process represented by the functional composition (E ◦ F ◦ T )(ξ), which includes the
75 patch-based tokenizer in ViTs as a special case.
76 Key Properties in Visual Tokenization: Our goal is to construct an adaptive framework for visual
77 tokenization which addresses the limitations of patch-based tokenization. While not exhaustive, we
78 identify four key properties that we see as essential for effective tokenization in vision tasks;
79 (i) Adaptability: The tokenizer should adapt to the image data and be differentiable for end-to-end
80 learning and optimizing performance across various downstream tasks.
81 (ii) Granularity: The tokenizer should be capable of delineating regions with pixel-level detail. This
82 allows the model to perform accurately on tasks that demand a high degree of spatial precision.
83 (iii) Deduplication: The tokenizer should leverage inherent redundancies in visual data to minimize
84 computational demands while maximizing the information density in the representations.
85 (iv) Modularity: The tokenizer should be independent yet commensurable with standard transformer
86 backbones. This enables different tokenizers to be deployed depending on the downstream task.

87 3 An Adaptive Tokenizer for ViTs

88 Given our generalized framework, we propose a method to learn visual tokens that adapt to the
89 contents of the input images while token embeddings commensurable with standard ViTs. Central

2
90 to our framework is our proposed learnable tokenizer T . We design our feature extractor F and
91 embedder E to maximize modularity and compatibility with general transformer backbones. In
92 essence, the tokenizer T uses a shallow convolutional network ϕ in conjunction with a recurrent
93 graph message passing operator γ, where we combine node aggregation with a proposed transitive
94 pooling operator using parallel connected components. The role of the convolutional network ϕ is
95 to extract local features such as gradient and texture information as a measure of homogeneity of
96 image regions. The graph message passing layer γ then iteratively computes edge weights w between
97 neighboring vertices and aggregates regions by mean transitive pooling up to a preferred level of
98 granularity.
99 Graph Partitioning: We consider an image as a lattice graph, letting E (0) ⊂ I ×I denote four-
100 way adjacency edges given some image size H ×W . A superpixel region is then a set S ⊂ I,
101 and we say that S is connected if for any two pixels a, b ∈ S, there exists a sequence of edges in
k−1
102 (ij , ij+1 ) ∈ E (0) j=1 such that i1 = a and ik = b. A set of superpixels form a partition π of an
103 image if for any two distinct superpixels S, S ′ ∈ π, their intersection S S ∩ S ′ = ∅, and the union of
104 all superpixels is equal to the set of all pixel positions in the image, i.e., S∈π(t) S = I.
I
105 Let Π(I) ⊂ 22 denote the space of all possible partitions of an image. We consider sequences
106 of partitions (π (t) )Tt=0 , and say that a partition π (t) is a refinement of another partition π (t+1)
107 if for all superpixels S ∈ π (t) there exists an S ′ ∈ π (t+1) such that S ⊆ S ′ , and we write
108 π (t) ⊑ π (t+1) . Our approach is to construct a T -level hierarchical partitioning of the pixel in-
T
109 dices H = π (t) ∈ Π(I) : π (t) ⊑ π (t+1) t=0 such that each superpixel is connected.
110 To construct the hierarchical partitioning H, we successively merge regions in parallel to coursen
111 the partitions π (t) 7→ π (t+1) . We do this by considering each level of the hierarchy as a graph G(t)
112 where each vertex v ∈ V (t) represents a superpixel in the partition π (t) , and each edge (u, v) ∈ E (t)
113 represents adjacent superpixels for levels t = 0, . . . , T . Two superpixels S, S ′ ∈ π are adjacent if
114 there exists at least one pair of pixels (a, b) with a ∈ S and b ∈ S ′ which are neighboring pixels in the
115 original image. The initial vertex features are computed by V (0) = ϕ(ξ),such that the initial lattice
116 graph G(0) = (V (0) , E (0) ) corresponds to the singleton partition π (0) = {i} : i ∈ I . Consecutive
117 levels are then computed using the message passing operator γ : G(t) → G(t+1) .
118 Message Passing: The message passing is done
119 in three steps; (i) compute edge weights, (ii) (0)
v11
1.00
(0)
v12
1.00
(0)
v13
1.00
(0)
v14
1.00
(0)
v15
120 select maximally homogeneous neighbors (iii)
−1.00

(1)
1.00

0.93

0.98

0.99

v1
121 update vertices by transitive pooling and mean (0)
v21
0.94
(0)
v22
−0.97
(0)
v23
−1.00
(0)
v24
1.00
(0)
v25
122 aggregation. For (i) we require a functional
−0.99

−0.97

(1)
0.98

0.99

1.00

v3
123 w : E (t) → R for the current iteration of the hi- (0)
1.00
(0)
0.98
(0)
−0.92
(0)
1.00
(0) (1)
v31 v32 v33 v34 v35 v2
124 erarchy. Without loss of generality, we assume
−0.71

−0.15

−0.98

−0.99
1.00

125 a fixed similarity functional but note that any (0)

−0.71
(0)
0.73
(0)
−0.33
(0)
1.00
(0) (1)
v41 v42 v43 v44 v45 v6
126 distance or similarity measure can be applied.
−0.98

−0.74

−0.92

(1)
0.60

0.44

v4
127 To promote compactness and superpixel sizes 0.01 −0.02 0.80 0.94
(0) (0) (0) (0) (0) (1)
128 we note that regularization can be performed by v51 v52 v53 v54 v55 v5

129 including self-loops in edge-weights, such that

130 the edge weight functional is given by Figure 2: Visualization of superpixel aggregation
( with transitive pooling. The colored outlines in-
sim(u, v), for u ̸= v; dicate the contracted components. Dashed edges
w(u, v) = (1) indicate non-contracted edges propagated to the
J(u), otherwise, next level of the hierarchy.
131 where J is a suitable regularizer. We outline and
132 ablate different choices for sim and J in Sec. 6.
133 For step (ii) we propose a greedy parallel update of the graph with respect to w. Let N(v) denote the
134 neighborhood of adjacent vertices of v. We construct an intermediate set of edges, given by

(t) (t)
Ê = v, arg max w(u, v) : v ∈ V . (2)
u∈N(v)

135 The final step (iii) of message passing aggregates the graph by contracting edges and vertices with
136 respect to the intermediate edge set Ê (t) . To update the graph in parallel, we apply our proposed

3
Table 1: Classification accuracy (Top 1) from end-to-end pretraining on IN1 K.
Model Perf. INR EA L (224) IN1 K (224) C ALTECH (224) C IFAR 100 (224)
†
Mod. Adp. Grad. # Par. Im./s. kNN Lin. kNN Lin. kNN Lin. kNN Lin.
B16 ✗ ✗ 86.6M 793.04 0.978 0.853 0.737 0.802 0.879 0.879 0.897 0.892
B16 ✗ ✓ 86.8M 721.12 0.975 0.854 0.748 0.805 0.885 0.889 0.899 0.899
B16 ✓ ✗ 86.6M 690.72 0.954 0.793 0.569 0.760 0.829 0.833 0.634 0.813
B16 ✓ ✓ 86.8M 640.59 0.980 0.858 0.752 0.804 0.891 0.888 0.845 0.884
†
Median throughput estimated over training with 4× MI250X GPUs using float32 precision.

137 transitive pooling operator to ensure that the partitions are still valid before aggregation. The
(t+1) (t)
transitive closure Ê (t)+ explicitly provides πv = u∈N+ (v) πu , where N+ (v) denotes the
S
138

139 connected components of v of Ê (t) . This assures that each partition at level (t + 1) is a connected
140 region. After
P transitive pooling is applied, vertex features are updated via mean aggregation, giving
141 v (t+1) = u(t) ∈N+ (v) u(t) .
142 To guarantee scale invariance, the message passing layer needs to be invariant to the iterative nature of
143 these operations, hence we avoid local projections at each step. Finally, to ensure that the parameters
144 of T is learnable with differentiable optimization, we apply a mean injection step, where the mean of
145 the tokenized regions are substituted with a prediction from the vertex features.

146 4 Methodology

147 5 Experiments and Results

148 As any dedicated reader can clearly see, the Ideal of practical reason is a representation of, as far as
149 I know, the things in themselves; as I have shown elsewhere, the phenomena should only be used
150 as a canon for our understanding. The paralogisms of practical reason are what first give rise to the
151 architectonic of practical reason. As will easily be shown in the next section, reason would thereby
152 be made to contradict, in view of these considerations, the Ideal of practical reason, yet the manifold
153 depends on the phenomena. Necessity depends on, when thus treated as the practical employment of
154 the never-ending regress in the series of empirical conditions, time. Human reason depends on our
155 sense perceptions, by means of analytic unity. There can be no doubt that the objects in space and
156 time are what first give rise to human reason.

157 5.1 Classification

158 We evaluate our adaptive tokenization scheme for ViTs by pretraining end-to-end on ImageNet1k
159 (addcite) for classification, and validate using the ImageNet ReaL (addcite) labels as well as the
160 original ImageNet validation labels. We ablate over the adaptive tokenizer described in Sec. 3 and
161 the standard patch-based tokenizer, as well as the incorporation of gradient features. We also report
162 downstream classification task results by fine-tuning on Caltech256 (addcite) and Cifar100 (addcite).
163 The results in Table 1 show that ViTs can be successfully trained with adaptive tokenization.
164 Let us suppose that the noumena have nothing to do with necessity, since knowledge of the Categories
165 is a posteriori. Hume tells us that the transcendental unity of apperception can not take account of the
166 discipline of natural reason, by means of analytic unity. As is proven in the ontological manuals, it is
167 obvious that the transcendental unity of apperception proves the validity of the Antinomies; what
168 we have alone been able to show is that, our understanding depends on the Categories. It remains
169 a mystery why the Ideal stands in need of reason. It must not be supposed that our faculties have
170 lying before them, in the case of the Ideal, the Antinomies; so, the transcendental aesthetic is just as
171 necessary as our experience. By means of the Ideal, our sense perceptions are by their very nature
172 contradictory.

173 5.2 Segmentation

174 As is shown in the writings of Aristotle, the things in themselves (and it remains a mystery why this
175 is the case) are a representation of time. Our concepts have lying before them the paralogisms of

4
Table 2: Classification accuracy (Top 1) on drop-in tokenizer w. fine tuning.

INR EA L IN1 K C ALTECH† C UB 200† S TANFORD C ARS†

Model Tok. Acc.1 Acc.5 Acc.1 Acc.5 Acc.1 Acc.5 Acc.1 Acc.5 Acc.1 Acc.5
S16 (22M) ViT 86.10 96.99 80.40 95.17 86.96 96.00 60.96 86.59 24.93 42.12
S16 (22M) APx 84.89 96.81 80.11 94.98 91.05 97.63 81.45 95.86 31.10 46.68
B16 (87M) ViT 87.70 97.62 82.63 96.43 90.27 97.86 72.51 92.84 35.13 56.75
B16 (87M) APx 88.27 97.75 84.21 96.63 93.33 98.42 81.33 96.51 32.70 50.84
B32 (88M) ViT 76.96 88.58 70.24 91.86 82.36 93.57 57.75 84.24 20.97 30.36
B32 (88M) APx‡ 87.32 97.83 83.73 96.78 93.86 98.77 83.86 97.00 39.80 61.72
†
Using frozen backbone and linear probing.
‡
Note that adaptive tokenization results in higher numbers of tokens compared to baseline.

Table 3: Comparison of token granularity by mean number of tokens (#) and scale invariance on
IN1 K over different image sizes. We adjust the number of layers and apply a thresholding step to
have equivalent token granularity with baseline models.
128×128 224×224 256×256 384×384 512×512
Model Tok. # Acc.1 # Acc.1 # Acc.1 # Acc.1 # Acc.1
S16 ViT 64.00 71.96 196.00 80.40 256.00 81.17 576.00 78.87 1024.00 74.66
S16 APx 70.06 68.54 198.34 78.64 274.91 79.70 612.25 80.13 1095.64 79.046
B16 ViT 64.00 77.22 196.00 82.63 256.00 83.28 576.00 81.67 1024.00 78.60
B16 APx 70.69 75.30 209.49 83.61 286.62 84.10 519.13 84.55 1039.82 84.82
B32 ViT 16.00 43.98 49.00 70.24 64.00 70.90 144.00 70.90 256.00 66.05
B32 APx 17.53 52.86 73.35 78.28 109.58 80.10 205.24 80.71 261.41 83.63

176 natural reason, but our a posteriori concepts have lying before them the practical employment of our
177 experience. Because of our necessary ignorance of the conditions, the paralogisms would thereby be
178 made to contradict, indeed, space; for these reasons, the Transcendental Deduction has lying before it
179 our sense perceptions. (Our a posteriori knowledge can never furnish a true and demonstrated science,
180 because, like time, it depends on analytic principles.) So, it must not be supposed that our experience
181 depends on, so, our sense perceptions, by means of analysis. Space constitutes the whole content for
182 our sense perceptions, and time occupies part of the sphere of the Ideal concerning the existence of
183 the objects in space and time in general.

184 5.3 Interpretability and XAI

185 As we have already seen, what we have alone been able to show is that the objects in space and
186 time would be falsified; what we have alone been able to show is that, our judgements are what
187 first give rise to metaphysics. As I have shown elsewhere, Aristotle tells us that the objects in
188 space and time, in the full sense of these terms, would be falsified. Let us suppose that, indeed,
189 our problematic judgements, indeed, can be treated like our concepts. As any dedicated reader
190 can clearly see, our knowledge can be treated like the transcendental unity of apperception, but the
191 phenomena occupy part of the sphere of the manifold concerning the existence of natural causes in
192 general. Whence comes the architectonic of natural reason, the solution of which involves the relation
193 between necessity and the Categories? Natural causes (and it is not at all certain that this is the case)
194 constitute the whole content for the paralogisms. This could not be passed over in a complete system
195 of transcendental philosophy, but in a merely critical essay the simple mention of the fact may suffice.

Table 4: Semantic segmantation on ADE20k. NA signifies values not reported by the authors.
Model Tok. Model Im.Sz. mIoU mAcc
B16 ViT iBOT [13] NA 48.6 59.3
B16 ViT dBOT [13] NA 49.5 60.7
Swin-B ViT Mask2Former [3] 640 52.4 NA
B16 APx APx+MLP 512 54.0 82.8
B32 APx APx+MLP 512 53.8 82.8

5
Table 5: Results for unsupervised salient segmentation.
ECSSD DUTS DUT-OMRON
Model Postproc. Fβ IoU Acc. Fβ IoU Acc. Fβ IoU Acc.
DINO-B14 None 0.803 0.712 0.918 0.672 0.576 0.903 0.600 0.533 0.880
DINO-B14 BL 0.874 0.772 0.934 0.755 0.624 0.914 0.697 0.618 0.897
DINO-S8 MF 0.894 0.779 0.943 0.789 0.648 0.938 0.733 0.609 0.923
DINO-S8 MF+BL 0.911 0.803 0.951 0.819 0.694 0.949 0.774 0.677 0.939
DINO-S8 Seg+MF 0.921 0.835 0.956 0.829 0.728 0.954 0.756 0.666 0.933
DINO-S8 Seg+MF+BL 0.917 0.800 0.952 0.827 0.687 0.952 0.766 0.665 0.937
SPiT-B16 None 0.903 0.773 0.934 0.771 0.639 0.894 0.711 0.564 0.868
SPiT-B16 GBL 0.924 0.799 0.942 0.779 0.644 0.905 0.719 0.587 0.898

Table 6: Faithfulness of Attributions, w. CI (95%).

ViT-B16 (IN1 K) RViT-B16 (IN1 K) SPiT-B16 (IN1 K)
C OMP ↑ S UFF ↓ C OMP ↑ S UFF ↓ C OMP ↑ S UFF ↓
LIME/SLIC 0.244 ± 0.004 0.543 ± 0.006 0.236 ± 0.004 0.591 ± 0.007 0.244 ± 0.005 0.520 ± 0.006
ATT.F LOW 0.160 ± 0.004 0.664 ± 0.006 0.223 ± 0.005 0.685 ± 0.007 0.259 ± 0.006 0.558 ± 0.006
P ROT.PCA 0.206 ± 0.005 0.710 ± 0.006 0.209 ± 0.005 0.691 ± 0.007 0.256 ± 0.005 0.592 ± 0.006
Color coding: baseline, weaker than baseline, stronger than baseline.

196 6 Ablations

197 Therefore, we can deduce that the objects in space and time (and I assert, however, that this is the
198 case) have lying before them the objects in space and time. Because of our necessary ignorance of the
199 conditions, it must not be supposed that, then, formal logic (and what we have alone been able to show
200 is that this is true) is a representation of the never-ending regress in the series of empirical conditions,
201 but the discipline of pure reason, in so far as this expounds the contradictory rules of metaphysics,
202 depends on the Antinomies. By means of analytic unity, our faculties, therefore, can never, as a
203 whole, furnish a true and demonstrated science, because, like the transcendental unity of apperception,
204 they constitute the whole content for a priori principles; for these reasons, our experience is just as
205 necessary as, in accordance with the principles of our a priori knowledge, philosophy. The objects in
206 space and time abstract from all content of knowledge. Has it ever been suggested that it remains
207 a mystery why there is no relation between the Antinomies and the phenomena? It must not be
208 supposed that the Antinomies (and it is not at all certain that this is the case) are the clue to the
209 discovery of philosophy, because of our necessary ignorance of the conditions. As I have shown
210 elsewhere, to avoid all misapprehension, it is necessary to explain that our understanding (and it
211 must not be supposed that this is true) is what first gives rise to the architectonic of pure reason, as is
212 evident upon close examination.
213 The Tanimoto similarity is given by

⟨u, v⟩
sim(u, v) = . (3)
⟨u, v⟩ + ∥u − v∥2

214 7 Discussion and Related Work

215 The things in themselves are what first give rise to reason, as is proven in the ontological manuals. By
216 virtue of natural reason, let us suppose that the transcendental unity of apperception abstracts from all
217 content of knowledge; in view of these considerations, the Ideal of human reason, on the contrary, is
218 the key to understanding pure logic. Let us suppose that, irrespective of all empirical conditions, our
219 understanding stands in need of our disjunctive judgements. As is shown in the writings of Aristotle,
220 pure logic, in the case of the discipline of natural reason, abstracts from all content of knowledge.
221 Our understanding is a representation of, in accordance with the principles of the employment of
222 the paralogisms, time. I assert, as I have shown elsewhere, that our concepts can be treated like
223 metaphysics. By means of the Ideal, it must not be supposed that the objects in space and time are
224 what first give rise to the employment of pure reason.

6
225 References
226 [1] Daniel Bolya, Fu Cheng-Yang, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman.
227 Token merging: Your ViT but faster. Inter. Conf. Learn. Represent. (ICLR), 2023. ISSN 2331-8422.

228 [2] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
229 Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
230 Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens
231 Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack
232 Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language
233 models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina
234 Balcan, and Hsuan-Tien Lin, editors, Adv. Neural Inf. Process. Sys. (NeurIPS), 2020.

235 [3] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-
236 attention mask transformer for universal image segmentation. 2022.

237 [4] Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.
238 In Inter. Conf. Learn. Represent. (ICLR), 2024.

239 [5] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas
240 Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer,
241 Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver,
242 Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher
243 Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey A. Gritsenko, Vighnesh
244 Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin
245 Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, and Neil Houlsby.
246 Scaling vision transformers to 22 billion parameters. Proceedings of Machine Learning Research, 202:
247 7480–7512, 23–29 Jul 2023.

248 [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep
249 bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar
250 Solorio, editors, Conf. North Amer. Ch. Assoc. Comput. Ling. (NAACL), pages 4171–4186. Association for
251 Computational Linguistics, 2019.

252 [7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
253 Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and
254 Neil Houlsby. An image is worth 16×16 words: Transformers for image recognition at scale. In Inter.
255 Conf. Learn. Represent. (ICLR), 2021.

256 [8] Robert M. Haralick and Linda G. Shapiro. Image segmentation techniques. Computer Vision, Graphics,
257 and Image Processing, 29(1):100–132, 1985. ISSN 0734-189X.

258 [9] Jakob Drachmann Havtorn, Amelie Royer, Tijmen Blankevoort, and Babak Ehteshami Bejnordi. MSViT:
259 Dynamic mixed-scale tokenization for vision transformers. In IEEE Inter. Conf. Comput. Vis. Wksps.
260 (ICCVW), pages 838–848, 2023. ISBN 9798350307443.

261 [10] Huaibo Huang, Xiaoqiang Zhou, Jie Cao, Ran He, and Tieniu Tan. Vision transformer with super token
262 sampling, 2022.

263 [11] Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,
264 Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L’elio Renard
265 Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée
266 Lacroix, and William El Sayed. Mistral 7b, 2023.

267 [12] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
268 Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s
269 multilingual neural machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput.
270 Linguistics, 5:339–351, 2017.

271 [13] Xingbin Liu, Jinghao Zhou, Tao Kong, Xianming Lin, and Rongrong Ji. Exploring target representations
272 for masked autoencoders. In Inter. Conf. Learn. Represent. (ICLR), 2024.

273 [14] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang,
274 Li Dong, Furu Wei, and Baining Guo. Swin transformer V2: Scaling up capacity and resolution. In
275 IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), pages 11999–12009, 2022.

276 [15] Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, and Yun Fu. Image as set of points. In
277 Inter. Conf. Learn. Represent. (ICLR), 2023. URL https://openreview.net/forum?id=awnvqZja69.

7
278 [16] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre
279 Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech
280 Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel
281 Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski.
282 DINOv2: Learning robust visual features without supervision. In Trans. Mach. Learn. Res., 2024.

283 [17] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models
284 are unsupervised multitask learners, 2019.
285 [18] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
286 Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning
287 transferable visual models from natural language supervision. In Inter. Conf. Mach. Learn. (ICML), volume
288 139, pages 8748–8763, 2021.

289 [19] Tomer Ronen, Omer Levy, and Avram Golbert. Vision transformers with mixed-resolution tokenization.
290 In IEEE/CVF Inter. Conf. Comput. Vis. Pattern Recog. (CVPR), pages 4612–4621, 2023.
291 [20] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
292 subword units. In Conf. Assoc. Comput. Ling. (ACL), 2016.

293 [21] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
294 subword units. In Katrin Erk and Noah A. Smith, editors, Conf. Assoc. Comput. Ling. (ACL), pages
295 1715–1725, Berlin, Germany, 2016.

296 [22] Hugo Touvron, Matthieu Cord, and Hervé Jégou. DeiT III: revenge of the ViT. In European Conf. Comput.
297 Vis. (ECCV), volume 13684, pages 516–533, 2022.

298 [23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
299 Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
300 Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models, 2023.
301 [24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
302 Kaiser, and Illia Polosukhin. Attention is all you need. In Adv. Neural Inf. Process. Sys. (NeurIPS),
303 volume 30, 2017.
304 [25] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis E. H. Tay, Jiashi Feng,
305 and Shuicheng Yan. Tokens-to-Token ViT: Training vision transformers from scratch on imagenet. In
306 IEEE Inter. Conf. Comput. Vis. (ICCV), pages 538–547, 2021.

307 A Appendix / supplemental material

308 Optionally include supplemental material (complete proofs, additional experiments and plots) in
309 appendix. All such materials SHOULD be included in the main submission.

8
310 NeurIPS Paper Checklist

311 The checklist is designed to encourage best practices for responsible machine learning research,
312 addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove
313 the checklist: The papers not including the checklist will be desk rejected. The checklist should
314 follow the references and follow the (optional) supplemental material. The checklist does NOT count
315 towards the page limit.
316 Please read the checklist guidelines carefully for information on how to answer these questions. For
317 each question in the checklist:
318 • You should answer [Yes] , [No] , or [NA] .
319 • [NA] means either that the question is Not Applicable for that particular paper or the
320 relevant information is Not Available.
321 • Please provide a short (1–2 sentence) justification right after your answer (even for NA).
322 The checklist answers are an integral part of your paper submission. They are visible to the
323 reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it
324 (after eventual revisions) with the final version of your paper, and its final version will be published
325 with the paper.
326 The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation.
327 While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a
328 proper justification is given (e.g., "error bars are not reported because it would be too computationally
329 expensive" or "we were unable to find the license for the dataset we used"). In general, answering
330 "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we
331 acknowledge that the true answer is often more nuanced, so please just use your best judgment and
332 write a justification to elaborate. All supporting evidence can appear either in the main paper or the
333 supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification
334 please point to the section(s) where related material for the question can be found.
335 IMPORTANT, please:
336 • Delete this instruction block, but keep the section heading “NeurIPS paper checklist",
337 • Keep the checklist subsection headings, questions/answers and guidelines below.
338 • Do not modify the questions and only use the provided macros for your answers.
339 1. Claims
340 Question: Do the main claims made in the abstract and introduction accurately reflect the
341 paper’s contributions and scope?
342 Answer: [TODO]
343 Justification: [TODO]
344 Guidelines:
345 • The answer NA means that the abstract and introduction do not include the claims
346 made in the paper.
347 • The abstract and/or introduction should clearly state the claims made, including the
348 contributions made in the paper and important assumptions and limitations. A No or
349 NA answer to this question will not be perceived well by the reviewers.
350 • The claims made should match theoretical and experimental results, and reflect how
351 much the results can be expected to generalize to other settings.
352 • It is fine to include aspirational goals as motivation as long as it is clear that these goals
353 are not attained by the paper.
354 2. Limitations
355 Question: Does the paper discuss the limitations of the work performed by the authors?
356 Answer: [TODO]
357 Justification: [TODO]
358 Guidelines:
359 • The answer NA means that the paper has no limitation while the answer No means that
360 the paper has limitations, but those are not discussed in the paper.
361 • The authors are encouraged to create a separate "Limitations" section in their paper.
362 • The paper should point out any strong assumptions and how robust the results are to
363 violations of these assumptions (e.g., independence assumptions, noiseless settings,
364 model well-specification, asymptotic approximations only holding locally). The authors

9
365 should reflect on how these assumptions might be violated in practice and what the
366 implications would be.
367 • The authors should reflect on the scope of the claims made, e.g., if the approach was
368 only tested on a few datasets or with a few runs. In general, empirical results often
369 depend on implicit assumptions, which should be articulated.
370 • The authors should reflect on the factors that influence the performance of the approach.
371 For example, a facial recognition algorithm may perform poorly when image resolution
372 is low or images are taken in low lighting. Or a speech-to-text system might not be
373 used reliably to provide closed captions for online lectures because it fails to handle
374 technical jargon.
375 • The authors should discuss the computational efficiency of the proposed algorithms
376 and how they scale with dataset size.
377 • If applicable, the authors should discuss possible limitations of their approach to
378 address problems of privacy and fairness.
379 • While the authors might fear that complete honesty about limitations might be used by
380 reviewers as grounds for rejection, a worse outcome might be that reviewers discover
381 limitations that aren’t acknowledged in the paper. The authors should use their best
382 judgment and recognize that individual actions in favor of transparency play an impor-
383 tant role in developing norms that preserve the integrity of the community. Reviewers
384 will be specifically instructed to not penalize honesty concerning limitations.
385 3. Theory Assumptions and Proofs
386 Question: For each theoretical result, does the paper provide the full set of assumptions and
387 a complete (and correct) proof?
388 Answer: [TODO]
389 Justification: [TODO]
390 Guidelines:
391 • The answer NA means that the paper does not include theoretical results.
392 • All the theorems, formulas, and proofs in the paper should be numbered and cross-
393 referenced.
394 • All assumptions should be clearly stated or referenced in the statement of any theorems.
395 • The proofs can either appear in the main paper or the supplemental material, but if
396 they appear in the supplemental material, the authors are encouraged to provide a short
397 proof sketch to provide intuition.
398 • Inversely, any informal proof provided in the core of the paper should be complemented
399 by formal proofs provided in appendix or supplemental material.
400 • Theorems and Lemmas that the proof relies upon should be properly referenced.
401 4. Experimental Result Reproducibility
402 Question: Does the paper fully disclose all the information needed to reproduce the main ex-
403 perimental results of the paper to the extent that it affects the main claims and/or conclusions
404 of the paper (regardless of whether the code and data are provided or not)?
405 Answer: [TODO]
406 Justification: [TODO]
407 Guidelines:
408 • The answer NA means that the paper does not include experiments.
409 • If the paper includes experiments, a No answer to this question will not be perceived
410 well by the reviewers: Making the paper reproducible is important, regardless of
411 whether the code and data are provided or not.
412 • If the contribution is a dataset and/or model, the authors should describe the steps taken
413 to make their results reproducible or verifiable.
414 • Depending on the contribution, reproducibility can be accomplished in various ways.
415 For example, if the contribution is a novel architecture, describing the architecture fully
416 might suffice, or if the contribution is a specific model and empirical evaluation, it may
417 be necessary to either make it possible for others to replicate the model with the same
418 dataset, or provide access to the model. In general. releasing code and data is often
419 one good way to accomplish this, but reproducibility can also be provided via detailed
420 instructions for how to replicate the results, access to a hosted model (e.g., in the case
421 of a large language model), releasing of a model checkpoint, or other means that are
422 appropriate to the research performed.

10
423 • While NeurIPS does not require releasing code, the conference does require all submis-
424 sions to provide some reasonable avenue for reproducibility, which may depend on the
425 nature of the contribution. For example
426 (a) If the contribution is primarily a new algorithm, the paper should make it clear how
427 to reproduce that algorithm.
428 (b) If the contribution is primarily a new model architecture, the paper should describe
429 the architecture clearly and fully.
430 (c) If the contribution is a new model (e.g., a large language model), then there should
431 either be a way to access this model for reproducing the results or a way to reproduce
432 the model (e.g., with an open-source dataset or instructions for how to construct
433 the dataset).
434 (d) We recognize that reproducibility may be tricky in some cases, in which case
435 authors are welcome to describe the particular way they provide for reproducibility.
436 In the case of closed-source models, it may be that access to the model is limited in
437 some way (e.g., to registered users), but it should be possible for other researchers
438 to have some path to reproducing or verifying the results.
439 5. Open access to data and code
440 Question: Does the paper provide open access to the data and code, with sufficient instruc-
441 tions to faithfully reproduce the main experimental results, as described in supplemental
442 material?
443 Answer: [TODO]
444 Justification: [TODO]
445 Guidelines:
446 • The answer NA means that paper does not include experiments requiring code.
447 • Please see the NeurIPS code and data submission guidelines (https://nips.cc/
448 public/guides/CodeSubmissionPolicy) for more details.
449 • While we encourage the release of code and data, we understand that this might not be
450 possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not
451 including code, unless this is central to the contribution (e.g., for a new open-source
452 benchmark).
453 • The instructions should contain the exact command and environment needed to run to
454 reproduce the results. See the NeurIPS code and data submission guidelines (https:
455 //nips.cc/public/guides/CodeSubmissionPolicy) for more details.
456 • The authors should provide instructions on data access and preparation, including how
457 to access the raw data, preprocessed data, intermediate data, and generated data, etc.
458 • The authors should provide scripts to reproduce all experimental results for the new
459 proposed method and baselines. If only a subset of experiments are reproducible, they
460 should state which ones are omitted from the script and why.
461 • At submission time, to preserve anonymity, the authors should release anonymized
462 versions (if applicable).
463 • Providing as much information as possible in supplemental material (appended to the
464 paper) is recommended, but including URLs to data and code is permitted.
465 6. Experimental Setting/Details
466 Question: Does the paper specify all the training and test details (e.g., data splits, hyper-
467 parameters, how they were chosen, type of optimizer, etc.) necessary to understand the
468 results?
469 Answer: [TODO]
470 Justification: [TODO]
471 Guidelines:
472 • The answer NA means that the paper does not include experiments.
473 • The experimental setting should be presented in the core of the paper to a level of detail
474 that is necessary to appreciate the results and make sense of them.
475 • The full details can be provided either with the code, in appendix, or as supplemental
476 material.
477 7. Experiment Statistical Significance
478 Question: Does the paper report error bars suitably and correctly defined or other appropriate
479 information about the statistical significance of the experiments?
480 Answer: [TODO]
481 Justification: [TODO]

11
482 Guidelines:
483 • The answer NA means that the paper does not include experiments.
484 • The authors should answer "Yes" if the results are accompanied by error bars, confi-
485 dence intervals, or statistical significance tests, at least for the experiments that support
486 the main claims of the paper.
487 • The factors of variability that the error bars are capturing should be clearly stated (for
488 example, train/test split, initialization, random drawing of some parameter, or overall
489 run with given experimental conditions).
490 • The method for calculating the error bars should be explained (closed form formula,
491 call to a library function, bootstrap, etc.)
492 • The assumptions made should be given (e.g., Normally distributed errors).
493 • It should be clear whether the error bar is the standard deviation or the standard error
494 of the mean.
495 • It is OK to report 1-sigma error bars, but one should state it. The authors should
496 preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis
497 of Normality of errors is not verified.
498 • For asymmetric distributions, the authors should be careful not to show in tables or
499 figures symmetric error bars that would yield results that are out of range (e.g. negative
500 error rates).
501 • If error bars are reported in tables or plots, The authors should explain in the text how
502 they were calculated and reference the corresponding figures or tables in the text.
503 8. Experiments Compute Resources
504 Question: For each experiment, does the paper provide sufficient information on the com-
505 puter resources (type of compute workers, memory, time of execution) needed to reproduce
506 the experiments?
507 Answer: [TODO]
508 Justification: [TODO]
509 Guidelines:
510 • The answer NA means that the paper does not include experiments.
511 • The paper should indicate the type of compute workers CPU or GPU, internal cluster,
512 or cloud provider, including relevant memory and storage.
513 • The paper should provide the amount of compute required for each of the individual
514 experimental runs as well as estimate the total compute.
515 • The paper should disclose whether the full research project required more compute
516 than the experiments reported in the paper (e.g., preliminary or failed experiments that
517 didn’t make it into the paper).
518 9. Code Of Ethics
519 Question: Does the research conducted in the paper conform, in every respect, with the
520 NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
521 Answer: [TODO]
522 Justification: [TODO]
523 Guidelines:
524 • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
525 • If the authors answer No, they should explain the special circumstances that require a
526 deviation from the Code of Ethics.
527 • The authors should make sure to preserve anonymity (e.g., if there is a special consid-
528 eration due to laws or regulations in their jurisdiction).
529 10. Broader Impacts
530 Question: Does the paper discuss both potential positive societal impacts and negative
531 societal impacts of the work performed?
532 Answer: [TODO]
533 Justification: [TODO]
534 Guidelines:
535 • The answer NA means that there is no societal impact of the work performed.
536 • If the authors answer NA or No, they should explain why their work has no societal
537 impact or why the paper does not address societal impact.
538 • Examples of negative societal impacts include potential malicious or unintended uses
539 (e.g., disinformation, generating fake profiles, surveillance), fairness considerations

12
540 (e.g., deployment of technologies that could make decisions that unfairly impact specific
541 groups), privacy considerations, and security considerations.
542 • The conference expects that many papers will be foundational research and not tied
543 to particular applications, let alone deployments. However, if there is a direct path to
544 any negative applications, the authors should point it out. For example, it is legitimate
545 to point out that an improvement in the quality of generative models could be used to
546 generate deepfakes for disinformation. On the other hand, it is not needed to point out
547 that a generic algorithm for optimizing neural networks could enable people to train
548 models that generate Deepfakes faster.
549 • The authors should consider possible harms that could arise when the technology is
550 being used as intended and functioning correctly, harms that could arise when the
551 technology is being used as intended but gives incorrect results, and harms following
552 from (intentional or unintentional) misuse of the technology.
553 • If there are negative societal impacts, the authors could also discuss possible mitigation
554 strategies (e.g., gated release of models, providing defenses in addition to attacks,
555 mechanisms for monitoring misuse, mechanisms to monitor how a system learns from
556 feedback over time, improving the efficiency and accessibility of ML).
557 11. Safeguards
558 Question: Does the paper describe safeguards that have been put in place for responsible
559 release of data or models that have a high risk for misuse (e.g., pretrained language models,
560 image generators, or scraped datasets)?
561 Answer: [TODO]
562 Justification: [TODO]
563 Guidelines:
564 • The answer NA means that the paper poses no such risks.
565 • Released models that have a high risk for misuse or dual-use should be released with
566 necessary safeguards to allow for controlled use of the model, for example by requiring
567 that users adhere to usage guidelines or restrictions to access the model or implementing
568 safety filters.
569 • Datasets that have been scraped from the Internet could pose safety risks. The authors
570 should describe how they avoided releasing unsafe images.
571 • We recognize that providing effective safeguards is challenging, and many papers do
572 not require this, but we encourage authors to take this into account and make a best
573 faith effort.
574 12. Licenses for existing assets
575 Question: Are the creators or original owners of assets (e.g., code, data, models), used in
576 the paper, properly credited and are the license and terms of use explicitly mentioned and
577 properly respected?
578 Answer: [TODO]
579 Justification: [TODO]
580 Guidelines:
581 • The answer NA means that the paper does not use existing assets.
582 • The authors should cite the original paper that produced the code package or dataset.
583 • The authors should state which version of the asset is used and, if possible, include a
584 URL.
585 • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
586 • For scraped data from a particular source (e.g., website), the copyright and terms of
587 service of that source should be provided.
588 • If assets are released, the license, copyright information, and terms of use in the
589 package should be provided. For popular datasets, paperswithcode.com/datasets
590 has curated licenses for some datasets. Their licensing guide can help determine the
591 license of a dataset.
592 • For existing datasets that are re-packaged, both the original license and the license of
593 the derived asset (if it has changed) should be provided.
594 • If this information is not available online, the authors are encouraged to reach out to
595 the asset’s creators.
596 13. New Assets
597 Question: Are new assets introduced in the paper well documented and is the documentation
598 provided alongside the assets?

13
599 Answer: [TODO]
600 Justification: [TODO]
601 Guidelines:
602 • The answer NA means that the paper does not release new assets.
603 • Researchers should communicate the details of the dataset/code/model as part of their
604 submissions via structured templates. This includes details about training, license,
605 limitations, etc.
606 • The paper should discuss whether and how consent was obtained from people whose
607 asset is used.
608 • At submission time, remember to anonymize your assets (if applicable). You can either
609 create an anonymized URL or include an anonymized zip file.
610 14. Crowdsourcing and Research with Human Subjects
611 Question: For crowdsourcing experiments and research with human subjects, does the paper
612 include the full text of instructions given to participants and screenshots, if applicable, as
613 well as details about compensation (if any)?
614 Answer: [TODO]
615 Justification: [TODO]
616 Guidelines:
617 • The answer NA means that the paper does not involve crowdsourcing nor research with
618 human subjects.
619 • Including this information in the supplemental material is fine, but if the main contribu-
620 tion of the paper involves human subjects, then as much detail as possible should be
621 included in the main paper.
622 • According to the NeurIPS Code of Ethics, workers involved in data collection, curation,
623 or other labor should be paid at least the minimum wage in the country of the data
624 collector.
625 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human
626 Subjects
627 Question: Does the paper describe potential risks incurred by study participants, whether
628 such risks were disclosed to the subjects, and whether Institutional Review Board (IRB)
629 approvals (or an equivalent approval/review based on the requirements of your country or
630 institution) were obtained?
631 Answer: [TODO]
632 Justification: [TODO]
633 Guidelines:
634 • The answer NA means that the paper does not involve crowdsourcing nor research with
635 human subjects.
636 • Depending on the country in which research is conducted, IRB approval (or equivalent)
637 may be required for any human subjects research. If you obtained IRB approval, you
638 should clearly state this in the paper.
639 • We recognize that the procedures for this may vary significantly between institutions
640 and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the
641 guidelines for their institution.
642 • For initial submissions, do not include any information that would break anonymity (if
643 applicable), such as the institution conducting the review.

2151 6982 1 SM
No ratings yet
2151 6982 1 SM
6 pages
Token Labeling for Vision Transformers
No ratings yet
Token Labeling for Vision Transformers
13 pages
Efficient Visual Tokenization
No ratings yet
Efficient Visual Tokenization
21 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
Paper 3
No ratings yet
Paper 3
7 pages
Scalable Vision Transformers With Hierarchical Pooling
No ratings yet
Scalable Vision Transformers With Hierarchical Pooling
11 pages
AE-ViT: Enhancing Vision Transformers
No ratings yet
AE-ViT: Enhancing Vision Transformers
12 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Vision Transformer for Small Datasets
No ratings yet
Vision Transformer for Small Datasets
11 pages
Wjarr 2025 2647
No ratings yet
Wjarr 2025 2647
11 pages
Marin Token Pooling in Vision Transformers For Image Classification WACV 2023 Paper
No ratings yet
Marin Token Pooling in Vision Transformers For Image Classification WACV 2023 Paper
10 pages
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
No ratings yet
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
14 pages
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
No ratings yet
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
10 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Convolutional Vision Transformers
No ratings yet
Convolutional Vision Transformers
10 pages
Video Vision Transformer Models
No ratings yet
Video Vision Transformer Models
14 pages
Gaurav Vision Transformer
No ratings yet
Gaurav Vision Transformer
10 pages
2024 GVT Shan Chen Arxiv
No ratings yet
2024 GVT Shan Chen Arxiv
9 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Vision Transformer: Revolutionizing Computer Vision
No ratings yet
Vision Transformer: Revolutionizing Computer Vision
13 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
An Image Is Worth More Than 16x16 Patches
No ratings yet
An Image Is Worth More Than 16x16 Patches
23 pages
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
No ratings yet
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
13 pages
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
No ratings yet
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
11 pages
Ai Int Arijit Dey PDF
No ratings yet
Ai Int Arijit Dey PDF
19 pages
Adaptive Token Sampling in Vision Transformers
No ratings yet
Adaptive Token Sampling in Vision Transformers
28 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Better Vision Transformer Via Token Pooling and Attention Sharing
No ratings yet
Better Vision Transformer Via Token Pooling and Attention Sharing
13 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
869 When Vision Transformers Outpe
No ratings yet
869 When Vision Transformers Outpe
20 pages
XXXBetter Plain ViT Baselines For ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines For ImageNet-1k
3 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
2103 TiT
No ratings yet
2103 TiT
10 pages
Vision Transformer (Vit) : Shusen Wang
No ratings yet
Vision Transformer (Vit) : Shusen Wang
35 pages
Pyramid Vision Transformer v2
No ratings yet
Pyramid Vision Transformer v2
10 pages
Transformers For Vision A Survey On Innovative Methods For Computer Vision
No ratings yet
Transformers For Vision A Survey On Innovative Methods For Computer Vision
28 pages
Lightweight
No ratings yet
Lightweight
23 pages
Abstract
No ratings yet
Abstract
2 pages
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
No ratings yet
A L I T R A: Daptive Ength Mage Okenization Via Ecurrent Llocation
21 pages
An Image Is Worth More Than 16x16 Patches - Explorting Transformers On Individual Pixels
No ratings yet
An Image Is Worth More Than 16x16 Patches - Explorting Transformers On Individual Pixels
21 pages
LLM
No ratings yet
LLM
28 pages
Vision Transformers Explained
No ratings yet
Vision Transformers Explained
11 pages
BinaryViT：高效、精确的二值ViT
No ratings yet
BinaryViT：高效、精确的二值ViT
12 pages
Transformer Segmentation
No ratings yet
Transformer Segmentation
35 pages
Applications of AI
No ratings yet
Applications of AI
13 pages
Shift-Equivariant Vision Transformers
No ratings yet
Shift-Equivariant Vision Transformers
25 pages
PHD Title: Efficient Multimodal Vision Transformers For Embedded System
No ratings yet
PHD Title: Efficient Multimodal Vision Transformers For Embedded System
4 pages
Vision Transformer U1
No ratings yet
Vision Transformer U1
42 pages
A Survey On Visual Transformer
No ratings yet
A Survey On Visual Transformer
21 pages
ViT Robustness in Image Classification
No ratings yet
ViT Robustness in Image Classification
23 pages
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
No ratings yet
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
23 pages
Mask Auto Labelers
No ratings yet
Mask Auto Labelers
13 pages
Dynamicvit: Efficient Vision Transformers With Dynamic Token Sparsification
No ratings yet
Dynamicvit: Efficient Vision Transformers With Dynamic Token Sparsification
15 pages
PORTAL User Guide - Digital
No ratings yet
PORTAL User Guide - Digital
26 pages
The Psychology of Color in Marketing - How Brands Use Color To Influence You
No ratings yet
The Psychology of Color in Marketing - How Brands Use Color To Influence You
5 pages
Nilsson Model Overview and Applications
No ratings yet
Nilsson Model Overview and Applications
27 pages
Unit 72 - Cambridge 14 - Test 4 - Reading Passage 2
No ratings yet
Unit 72 - Cambridge 14 - Test 4 - Reading Passage 2
4 pages
General Practices Manual Maintenance Procedures Aircraft Wheel Bearings
No ratings yet
General Practices Manual Maintenance Procedures Aircraft Wheel Bearings
26 pages
FinalThesis Tahsina Islam 1104023
No ratings yet
FinalThesis Tahsina Islam 1104023
53 pages
Grade - 8 - Scientific Notation
No ratings yet
Grade - 8 - Scientific Notation
2 pages
Customer Conflict: De-Escalating Two Customers
No ratings yet
Customer Conflict: De-Escalating Two Customers
3 pages
Science Technology and Society Syllabus
No ratings yet
Science Technology and Society Syllabus
8 pages
Common Burnt Clay Building Bricks - Specification: Indian Standard
No ratings yet
Common Burnt Clay Building Bricks - Specification: Indian Standard
10 pages
Elsec g2 Cover
No ratings yet
Elsec g2 Cover
3 pages
Global Marketing Strategies for Sustainability
No ratings yet
Global Marketing Strategies for Sustainability
22 pages
Lindo ModelSimplification
No ratings yet
Lindo ModelSimplification
82 pages
Data Visualization Book
No ratings yet
Data Visualization Book
4 pages
A Detailed Lesson Plan
No ratings yet
A Detailed Lesson Plan
11 pages
2023 MK Sample Questions
No ratings yet
2023 MK Sample Questions
12 pages
Assignment4-Edinburgh Tram System PrasadTamhane
No ratings yet
Assignment4-Edinburgh Tram System PrasadTamhane
8 pages
As.15 Integration - Student Notes (For
No ratings yet
As.15 Integration - Student Notes (For
16 pages
Understanding the Gaseous State
No ratings yet
Understanding the Gaseous State
35 pages
Practice Test 01 - Esc 20: và kết thúc mỗi phần nghe có tín hiệu. trước tín hiệu nhạc kết thúc bài nghe
No ratings yet
Practice Test 01 - Esc 20: và kết thúc mỗi phần nghe có tín hiệu. trước tín hiệu nhạc kết thúc bài nghe
23 pages
Language Club Enthusiasts
No ratings yet
Language Club Enthusiasts
5 pages
SAP+MM++Practice+Book+ II
No ratings yet
SAP+MM++Practice+Book+ II
20 pages
Readings in The Behavioral Sciences - Session 2
No ratings yet
Readings in The Behavioral Sciences - Session 2
2 pages
Simple Thesis Topics
67% (3)
Simple Thesis Topics
8 pages
q2 Module 4 Ucsp Handout
No ratings yet
q2 Module 4 Ucsp Handout
15 pages
The Developmental Stages of The Learner
No ratings yet
The Developmental Stages of The Learner
56 pages
Module 1. History of Mathematics
No ratings yet
Module 1. History of Mathematics
3 pages
BASF PuriStar-R3-15S T5x3 PDS Rev19-10
No ratings yet
BASF PuriStar-R3-15S T5x3 PDS Rev19-10
2 pages
CL Day 3
No ratings yet
CL Day 3
4 pages
L12 - Magnetic Properties
No ratings yet
L12 - Magnetic Properties
20 pages

NeurIPS 2024

Uploaded by

NeurIPS 2024

Uploaded by

Adaptive Tokenization in Vision Transformers

1 Contrarily to transformers for language, Vision Transformers (ViTs) apply a static

39 sual tokenization in ViTs, providing a richer

46 nelized positional encoding that adapts to ir- 4 Quadformer 4 SPiT

59 2 On Tokenization in Vision Transformers

87 3 An Adaptive Tokenizer for ViTs

125 a fixed similarity functional but note that any (0)

129 including self-loops in edge-weights, such that

147 5 Experiments and Results

157 5.1 Classification

173 5.2 Segmentation

INR EA L IN1 K C ALTECH† C UB 200† S TANFORD C ARS†

184 5.3 Interpretability and XAI

Table 6: Faithfulness of Attributions, w. CI (95%).

214 7 Discussion and Related Work

307 A Appendix / supplemental material

You might also like