Vision Transformer (ViT)
Vision Transformer (ViT)
Transformers documentation
Vision Transformer (ViT)
Overview
The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It’s the first paper that
successfully trains a Transformer encoder on ImageNet, attaining very good results compared to
familiar convolutional architectures.
While the Transformer architecture has become the de-facto standard for natural language
processing tasks, its applications to computer vision remain limited. In vision, attention is either
applied in conjunction with convolutional networks, or used to replace certain components of
convolutional networks while keeping their overall structure in place. We show that this reliance
on CNNs is not necessary and a pure transformer applied directly to sequences of image patches
can perform very well on image classification tasks. When pre-trained on large amounts of data
and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-
100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art
convolutional networks while requiring substantially fewer computational resources to train.
https://huggingface.co/docs/transformers/en/model_doc/vit 1/26
12/23/24, 5:35 PM Vision Transformer (ViT)
Following the original Vision Transformer, some follow-up works have been made:
DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision
transformers. The authors of DeiT also released more efficiently trained ViT models, which
you can directly plug into ViTModel or ViTForImageClassification. There are 4 variants
available (in 3 different sizes): facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-
224, facebook/deit-base-patch16-224 and facebook/deit-base-patch16-384. Note that one
should use DeiTImageProcessor in order to prepare images for the model.
DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision
Transformers trained using the DINO method show very interesting properties not seen
with convolutional models. They are capable of segmenting objects, without having ever
been trained to do so. DINO checkpoints can be found on the hub.
https://huggingface.co/docs/transformers/en/model_doc/vit 2/26
12/23/24, 5:35 PM Vision Transformer (ViT)
This model was contributed by nielsr. The original code (written in JAX) can be found here.
Note that we converted the weights from Ross Wightman’s timm library, who already converted
the weights from JAX to PyTorch. Credits go to him!
Usage tips
To feed images to the Transformer encoder, each image is split into a sequence of fixed-size
non-overlapping patches, which are then linearly embedded. A [CLS] token is added to
serve as representation of an entire image, which can be used for classification. The
authors also add absolute position embeddings, and feed the resulting sequence of vectors
to a standard Transformer encoder.
As the Vision Transformer expects each image to be of the same size (resolution), one can
use ViTImageProcessor to resize (or rescale) and normalize images for the model.
Both the patch resolution and image resolution used during pre-training or fine-tuning are
reflected in the name of each checkpoint. For example, google/vit-base-patch16-224
refers to a base-sized architecture with patch resolution of 16x16 and fine-tuning resolution
of 224x224. All checkpoints can be found on the hub.
The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it
is often beneficial to use a higher resolution than pre-training (Touvron et al., 2019),
(Kolesnikov et al., 2020). In order to fine-tune at higher resolution, the authors perform 2D
interpolation of the pre-trained position embeddings, according to their location in the
original image.
The best results are obtained with supervised pre-training, which is not the case in NLP. The
authors also performed an experiment with a self-supervised pre-training objective,
namely masked patched prediction (inspired by masked language modeling). With this
approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant
improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
https://huggingface.co/docs/transformers/en/model_doc/vit 3/26
12/23/24, 5:35 PM Vision Transformer (ViT)
applied depending on the inputs and the hardware in use. See the official documentation or the
GPU Inference page for more information.
SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may
also set attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be
used.
For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16
or torch.bfloat16 ).
On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with float32 and
google/vit-base-patch16-224 model, we saw the following speedups during inference.
Batch Average inference time (ms), Average inference time (ms), Speed up, Sdpa /
size eager mode sdpa model Eager (x)
1 7 6 1.17
2 8 6 1.33
4 8 6 1.33
8 8 6 1.33
Resources
Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found
here. A list of official Hugging Face and community (indicated by 🌎) resources to help you get
started with ViT. If you’re interested in submitting a resource to be included here, please feel free
to open a Pull Request and we’ll review it! The resource should ideally demonstrate something
new instead of duplicating an existing resource.
https://huggingface.co/docs/transformers/en/model_doc/vit 4/26
12/23/24, 5:35 PM Vision Transformer (ViT)
Image Classification
A blog post on how to Fine-Tune ViT for Image Classification with Hugging Face
Transformers
A blog post on Image Classification with Hugging Face Transformers and Keras
A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with the Hugging Face
Trainer
A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with PyTorch Lightning
⚗️ Optimization
A blog post on how to Accelerate Vision Transformer (ViT) with Quantization using Optimum
⚡️ Inference
🚀 Deploy
A blog post on Deploying Tensorflow Vision Models in Hugging Face with TF Serving
ViTConfig
Parameters
https://huggingface.co/docs/transformers/en/model_doc/vit 5/26
12/23/24, 5:35 PM Vision Transformer (ViT)
• hidden_size ( int , optional, defaults to 768) — Dimensionality of the encoder layers and the
pooler layer.
• num_attention_heads ( int , optional, defaults to 12) — Number of attention heads for each
attention layer in the Transformer encoder.
This is the configuration class to store the configuration of a ViTModel. It is used to instantiate
an ViT model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the
ViT google/vit-base-patch16-224 architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model
outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> # Initializing a model (with random weights) from the vit-base-patch16-224 style
>>> model = ViTModel(configuration)
ViTFeatureExtractor
https://huggingface.co/docs/transformers/en/model_doc/vit 6/26
12/23/24, 5:35 PM Vision Transformer (ViT)
( *args, **kwargs )
__call__ <>
( images, **kwargs )
ViTImageProcessor
Parameters
• do_resize ( bool , optional, defaults to True ) — Whether to resize the image’s (height, width)
dimensions to the specified (size["height"], size["width"]) . Can be overridden by
the do_resize parameter in the preprocess method.
• size ( dict , optional, defaults to {"height" -- 224, "width": 224} ): Size of the output
image after resizing. Can be overridden by the size parameter in the preprocess method.
• do_rescale ( bool , optional, defaults to True ) — Whether to rescale the image by the
specified scale rescale_factor .Expand
Can be 9 parameters
overridden by the do_rescale parameter in the
preprocess method.
• rescale factor ( int or float optional defaults to 1/255 ) — Scale factor to use if rescaling
preprocess <>
https://huggingface.co/docs/transformers/en/model_doc/vit 7/26
12/23/24, 5:35 PM Vision Transformer (ViT)
Parameters
ViTImageProcessorFast
Parameters
https://huggingface.co/docs/transformers/en/model_doc/vit 8/26
12/23/24, 5:35 PM Vision Transformer (ViT)
• do_resize ( bool , optional, defaults to True ) — Whether to resize the image’s (height, width)
dimensions to the specified (size["height"], size["width"]) . Can be overridden by
the do_resize parameter in the preprocess method.
• size ( dict , optional, defaults to {"height" -- 224, "width": 224} ): Size of the output
image after resizing. Can be overridden by the size parameter in the preprocess method.
• do_rescale ( bool , optional, defaults to True ) — Whether to rescale the image by the
specified scale rescale_factor .Expand
Can be 9overridden by the do_rescale parameter in the
parameters
preprocess method.
• rescale_factor ( int or float , optional, defaults to 1/255 ) — Scale factor to use if rescaling
preprocess <>
Parameters
• Expand
do_rescale ( bool , optional, defaults to 12 parameters
self.do_rescale ) — Whether to rescale the
image values between [0 - 1].
• rescale factor ( float optional defaults to self rescale factor ) Rescale factor to
ViTModel
Parameters
• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.
The bare ViT Model transformer outputting raw hidden-states without any specific head
on top. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch
Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
forward <>
Parameters
https://huggingface.co/docs/transformers/en/model_doc/vit 10/26
12/23/24, 5:35 PM Vision Transformer (ViT)
Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.
Example:
https://huggingface.co/docs/transformers/en/model_doc/vit 11/26
12/23/24, 5:35 PM Vision Transformer (ViT)
ViTForMaskedImageModeling
( config: ViTConfig )
Parameters
• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.
ViT Model with a decoder on top for masked image modeling, as proposed in SimMIM.
Note that we provide a script to pre-train this model on custom data in our examples
directory.
forward <>
Parameters
• output hidden states ( bool optional) Whether or not to return the hidden
Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.
Examples:
ViTForImageClassification
https://huggingface.co/docs/transformers/en/model_doc/vit 13/26
12/23/24, 5:35 PM Vision Transformer (ViT)
( config: ViTConfig )
Parameters
• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.
ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.
Note that it’s possible to fine-tune ViT on higher resolution images than the ones it has
been trained on, by setting interpolate_pos_encoding to True in the forward of
the model. This will interpolate the pre-trained position embeddings to the higher
resolution.
forward <>
Parameters
Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.
Example:
TFViTModel
https://huggingface.co/docs/transformers/en/model_doc/vit 15/26
12/23/24, 5:35 PM Vision Transformer (ViT)
Parameters
• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.
The bare ViT Model transformer outputting raw hidden-states without any specific head
on top.
This model inherits from TFPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading or
saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer
to the TF 2.0 documentation for all matter related to general usage and behavior.
•a list of varying length with one or several input Tensors IN THE ORDER given in
the docstring: model([pixel_values, attention_mask]) or
model([pixel_values, attention_mask, token_type_ids])
https://huggingface.co/docs/transformers/en/model_doc/vit 16/26
12/23/24, 5:35 PM Vision Transformer (ViT)
•a dictionary with one or several input Tensors associated to the input names
given in the docstring: model({"pixel_values": pixel_values,
"token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to
worry about any of this, as you can just pass inputs like you would to any other Python
function!
call <>
Parameters
Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.
https://huggingface.co/docs/transformers/en/model_doc/vit 17/26
12/23/24, 5:35 PM Vision Transformer (ViT)
Example:
TFViTForImageClassification
Parameters
• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.
ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.
Note that it’s possible to fine-tune ViT on higher resolution images than the ones it has
been trained on, by setting interpolate_pos_encoding to True in the forward of
the model. This will interpolate the pre-trained position embeddings to the higher
resolution.
https://huggingface.co/docs/transformers/en/model_doc/vit 18/26
12/23/24, 5:35 PM Vision Transformer (ViT)
This model inherits from TFPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading or
saving, resizing the input embeddings, pruning heads etc.)
This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer
to the TF 2.0 documentation for all matter related to general usage and behavior.
•a list of varying length with one or several input Tensors IN THE ORDER given in
the docstring: model([pixel_values, attention_mask]) or
model([pixel_values, attention_mask, token_type_ids])
•a dictionary with one or several input Tensors associated to the input names
given in the docstring: model({"pixel_values": pixel_values,
"token_type_ids": token_type_ids})
Note that when creating models and layers with subclassing then you don’t need to
worry about any of this, as you can just pass inputs like you would to any other Python
function!
call <>
https://huggingface.co/docs/transformers/en/model_doc/vit 19/26
12/23/24, 5:35 PM Vision Transformer (ViT)
Parameters
Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.
Example:
https://huggingface.co/docs/transformers/en/model_doc/vit 20/26
12/23/24, 5:35 PM Vision Transformer (ViT)
>>> print(model.config.id2label[predicted_label])
Egyptian cat
FlaxVitModel
Parameters
• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.
Note that this only specifies the dtype of the computation and does not influence the
dtype of model parameters.
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
The bare ViT Model transformer outputting raw hidden-states without any specific head
on top.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading,
saving and converting weights from PyTorch models)
This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and
refer to the Flax documentation for all matter related to general usage and behavior.
https://huggingface.co/docs/transformers/en/model_doc/vit 21/26
12/23/24, 5:35 PM Vision Transformer (ViT)
__call__ <>
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or
Returns
tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when
config.return_dict=False ) comprising various elements depending on the
configuration ( <class
'transformers.models.vit.configuration_vit.ViTConfig'> ) and inputs.
Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
https://huggingface.co/docs/transformers/en/model_doc/vit 22/26
12/23/24, 5:35 PM Vision Transformer (ViT)
care of running the pre and post processing steps while the latter silently ignores
them.
Examples:
FlaxViTForImageClassification
Parameters
• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.
Note that this only specifies the dtype of the computation and does not influence the
dtype of model parameters.
https://huggingface.co/docs/transformers/en/model_doc/vit 23/26
12/23/24, 5:35 PM Vision Transformer (ViT)
If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().
ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.
This model inherits from FlaxPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading,
saving and converting weights from PyTorch models)
This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and
refer to the Flax documentation for all matter related to general usage and behavior.
__call__ <>
transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or
Returns
tuple(torch.FloatTensor)
A transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when
config.return_dict=False ) comprising various elements depending on the
configuration ( <class
'transformers.models.vit.configuration_vit.ViTConfig'> ) and inputs.
Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.
Example:
https://huggingface.co/docs/transformers/en/model_doc/vit 25/26
12/23/24, 5:35 PM Vision Transformer (ViT)
https://huggingface.co/docs/transformers/en/model_doc/vit 26/26