0% found this document useful (0 votes)
102 views26 pages

Vision Transformer (ViT)

The Vision Transformer (ViT) model, introduced by Dosovitskiy et al., successfully applies a Transformer architecture to image classification tasks, achieving competitive results against traditional convolutional networks while requiring less computational resources. ViT operates by splitting images into patches, embedding them, and using a standard Transformer encoder for processing. Several follow-up models, such as DeiT and BEiT, have built upon ViT's framework to enhance efficiency and performance in image recognition tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views26 pages

Vision Transformer (ViT)

The Vision Transformer (ViT) model, introduced by Dosovitskiy et al., successfully applies a Transformer architecture to image classification tasks, achieving competitive results against traditional convolutional networks while requiring less computational resources. ViT operates by splitting images into patches, embedding them, and using a standard Transformer encoder for processing. Several follow-up models, such as DeiT and BEiT, have built upon ViT's framework to enhance efficiency and performance in image recognition tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

12/23/24, 5:35 PM Vision Transformer (ViT)

Search models, datasets, users...

Transformers documentation
Vision Transformer (ViT)

Join the Hugging Face community


and get access to the augmented documentation experience

Sign Up to get started

Vision Transformer (ViT)

Overview

The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It’s the first paper that
successfully trains a Transformer encoder on ImageNet, attaining very good results compared to
familiar convolutional architectures.

The abstract from the paper is the following:

While the Transformer architecture has become the de-facto standard for natural language
processing tasks, its applications to computer vision remain limited. In vision, attention is either
applied in conjunction with convolutional networks, or used to replace certain components of
convolutional networks while keeping their overall structure in place. We show that this reliance
on CNNs is not necessary and a pure transformer applied directly to sequences of image patches
can perform very well on image classification tasks. When pre-trained on large amounts of data
and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-
100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art
convolutional networks while requiring substantially fewer computational resources to train.

https://huggingface.co/docs/transformers/en/model_doc/vit 1/26
12/23/24, 5:35 PM Vision Transformer (ViT)

ViT architecture. Taken from the original paper.

Following the original Vision Transformer, some follow-up works have been made:

DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision
transformers. The authors of DeiT also released more efficiently trained ViT models, which
you can directly plug into ViTModel or ViTForImageClassification. There are 4 variants
available (in 3 different sizes): facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-
224, facebook/deit-base-patch16-224 and facebook/deit-base-patch16-384. Note that one
should use DeiTImageProcessor in order to prepare images for the model.

BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models


outperform supervised pre-trained vision transformers using a self-supervised method
inspired by BERT (masked image modeling) and based on a VQ-VAE.

DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision
Transformers trained using the DINO method show very interesting properties not seen
with convolutional models. They are capable of segmenting objects, without having ever
been trained to do so. DINO checkpoints can be found on the hub.

MAE (Masked Autoencoders) by Facebook AI. By pre-training Vision Transformers to


reconstruct pixel values for a high portion (75%) of masked patches (using an asymmetric
encoder-decoder architecture), the authors show that this simple method outperforms
supervised pre-training after fine-tuning.

https://huggingface.co/docs/transformers/en/model_doc/vit 2/26
12/23/24, 5:35 PM Vision Transformer (ViT)

This model was contributed by nielsr. The original code (written in JAX) can be found here.

Note that we converted the weights from Ross Wightman’s timm library, who already converted
the weights from JAX to PyTorch. Credits go to him!

Usage tips

To feed images to the Transformer encoder, each image is split into a sequence of fixed-size
non-overlapping patches, which are then linearly embedded. A [CLS] token is added to
serve as representation of an entire image, which can be used for classification. The
authors also add absolute position embeddings, and feed the resulting sequence of vectors
to a standard Transformer encoder.

As the Vision Transformer expects each image to be of the same size (resolution), one can
use ViTImageProcessor to resize (or rescale) and normalize images for the model.

Both the patch resolution and image resolution used during pre-training or fine-tuning are
reflected in the name of each checkpoint. For example, google/vit-base-patch16-224
refers to a base-sized architecture with patch resolution of 16x16 and fine-tuning resolution
of 224x224. All checkpoints can be found on the hub.

The available checkpoints are either (1) pre-trained on ImageNet-21k (a collection of 14


million images and 21k classes) only, or (2) also fine-tuned on ImageNet (also referred to as
ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).

The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it
is often beneficial to use a higher resolution than pre-training (Touvron et al., 2019),
(Kolesnikov et al., 2020). In order to fine-tune at higher resolution, the authors perform 2D
interpolation of the pre-trained position embeddings, according to their location in the
original image.

The best results are obtained with supervised pre-training, which is not the case in NLP. The
authors also performed an experiment with a self-supervised pre-training objective,
namely masked patched prediction (inspired by masked language modeling). With this
approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant
improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

Using Scaled Dot Product Attention (SDPA)

https://huggingface.co/docs/transformers/en/model_doc/vit 3/26
12/23/24, 5:35 PM Vision Transformer (ViT)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of


torch.nn.functional . This function encompasses several implementations that can be

applied depending on the inputs and the hardware in use. See the official documentation or the
GPU Inference page for more information.

SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may
also set attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be
used.

from transformers import ViTForImageClassification


model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224", attn_
...

For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16
or torch.bfloat16 ).

On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with float32 and
google/vit-base-patch16-224 model, we saw the following speedups during inference.

Batch Average inference time (ms), Average inference time (ms), Speed up, Sdpa /
size eager mode sdpa model Eager (x)

1 7 6 1.17

2 8 6 1.33

4 8 6 1.33

8 8 6 1.33

Resources

Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found
here. A list of official Hugging Face and community (indicated by 🌎) resources to help you get
started with ViT. If you’re interested in submitting a resource to be included here, please feel free
to open a Pull Request and we’ll review it! The resource should ideally demonstrate something
new instead of duplicating an existing resource.

https://huggingface.co/docs/transformers/en/model_doc/vit 4/26
12/23/24, 5:35 PM Vision Transformer (ViT)

ViTForImageClassification is supported by:

Image Classification

A blog post on how to Fine-Tune ViT for Image Classification with Hugging Face
Transformers

A blog post on Image Classification with Hugging Face Transformers and Keras

A notebook on Fine-tuning for Image Classification with Hugging Face Transformers

A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with the Hugging Face
Trainer

A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with PyTorch Lightning

⚗️ Optimization

A blog post on how to Accelerate Vision Transformer (ViT) with Quantization using Optimum

⚡️ Inference

A notebook on Quick demo: Vision Transformer (ViT) by Google Brain

🚀 Deploy

A blog post on Deploying Tensorflow Vision Models in Hugging Face with TF Serving

A blog post on Deploying Hugging Face ViT on Vertex AI

A blog post on Deploying Hugging Face ViT on Kubernetes with TF Serving

ViTConfig

class transformers.ViTConfig <>

( hidden_size = 768, num_hidden_layers = 12, num_attention_heads = 12, intermediate_size =


3072, hidden_act = 'gelu', hidden_dropout_prob = 0.0, attention_probs_dropout_prob = 0.0,
initializer_range = 0.02, layer_norm_eps = 1e-12, image_size = 224, patch_size = 16,
num_channels = 3, qkv_bias = True, encoder_stride = 16, **kwargs )

Parameters

https://huggingface.co/docs/transformers/en/model_doc/vit 5/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• hidden_size ( int , optional, defaults to 768) — Dimensionality of the encoder layers and the
pooler layer.

• num_hidden_layers ( int , optional, defaults to 12) — Number of hidden layers in the


Transformer encoder.

• num_attention_heads ( int , optional, defaults to 12) — Number of attention heads for each
attention layer in the Transformer encoder.

• intermediate_size ( int , optional, defaults to 3072) — Dimensionality of the “intermediate”


(i.e., feed-forward) layer in the Transformer encoder.

• hidden_act ( str or function , optional, defaults to "gelu" ) — The non-linear activation


Expand 14 parameters
function (function or string) in the encoder and pooler. If string, "gelu" , "relu" , "selu"
and "gelu_new" are supported.

This is the configuration class to store the configuration of a ViTModel. It is used to instantiate
an ViT model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the
ViT google/vit-base-patch16-224 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model
outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import ViTConfig, ViTModel

>>> # Initializing a ViT vit-base-patch16-224 style configuration


>>> configuration = ViTConfig()

>>> # Initializing a model (with random weights) from the vit-base-patch16-224 style
>>> model = ViTModel(configuration)

>>> # Accessing the model configuration


>>> configuration = model.config

ViTFeatureExtractor

class transformers.ViTFeatureExtractor <>

https://huggingface.co/docs/transformers/en/model_doc/vit 6/26
12/23/24, 5:35 PM Vision Transformer (ViT)

( *args, **kwargs )

__call__ <>

( images, **kwargs )

Preprocess an image or a batch of images.

ViTImageProcessor

class transformers.ViTImageProcessor <>

( do_resize: bool = True, size: typing.Optional[typing.Dict[str, int]] = None, resample:


Resampling = <Resampling.BILINEAR: 2>, do_rescale: bool = True, rescale_factor:
typing.Union[int, float] = 0.00392156862745098, do_normalize: bool = True, image_mean:
typing.Union[float, typing.List[float], NoneType] = None, image_std: typing.Union[float,
typing.List[float], NoneType] = None, do_convert_rgb: typing.Optional[bool] = None,
**kwargs )

Parameters

• do_resize ( bool , optional, defaults to True ) — Whether to resize the image’s (height, width)
dimensions to the specified (size["height"], size["width"]) . Can be overridden by
the do_resize parameter in the preprocess method.

• size ( dict , optional, defaults to {"height" -- 224, "width": 224} ): Size of the output
image after resizing. Can be overridden by the size parameter in the preprocess method.

• resample ( PILImageResampling , optional, defaults to Resampling.BILINEAR ) —


Resampling filter to use if resizing the image. Can be overridden by the resample parameter
in the preprocess method.

• do_rescale ( bool , optional, defaults to True ) — Whether to rescale the image by the
specified scale rescale_factor .Expand
Can be 9 parameters
overridden by the do_rescale parameter in the
preprocess method.

• rescale factor ( int or float optional defaults to 1/255 ) — Scale factor to use if rescaling

Constructs a ViT image processor.

preprocess <>

https://huggingface.co/docs/transformers/en/model_doc/vit 7/26
12/23/24, 5:35 PM Vision Transformer (ViT)

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray,


ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')],
typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]], do_resize:
typing.Optional[bool] = None, size: typing.Dict[str, int] = None, resample: Resampling =
None, do_rescale: typing.Optional[bool] = None, rescale_factor: typing.Optional[float] =
None, do_normalize: typing.Optional[bool] = None, image_mean: typing.Union[float,
typing.List[float], NoneType] = None, image_std: typing.Union[float, typing.List[float],
NoneType] = None, return_tensors: typing.Union[str,
transformers.utils.generic.TensorType, NoneType] = None, data_format: typing.Union[str,
transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>,
input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension,
NoneType] = None, do_convert_rgb: typing.Optional[bool] = None )

Parameters

• images ( ImageInput ) — Image to preprocess. Expects a single or batch of images with


pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1,
set do_rescale=False .

• do_resize ( bool , optional, defaults to self.do_resize ) — Whether to resize the image.

• size ( Dict[str, int] , optional, defaults to self.size ) — Dictionary in the format


{"height": h, "width": w} specifying the size of the output image after resizing.

• resample ( PILImageResampling filter, optional, defaults to self.resample ) —


PILImageResampling filter to use if resizing the image e.g.
PILImageResampling.BILINEAR . Only has an effect if do_resize is set to True .
Expand 13 parameters
• do_rescale ( bool , optional, defaults to self.do_rescale ) — Whether to rescale the
image values between [0 - 1].

Preprocess an image or batch of images.

ViTImageProcessorFast

class transformers.ViTImageProcessorFast <>

( do_resize: bool = True, size: typing.Optional[typing.Dict[str, int]] = None, resample:


Resampling = <Resampling.BILINEAR: 2>, do_rescale: bool = True, rescale_factor:
typing.Union[int, float] = 0.00392156862745098, do_normalize: bool = True, image_mean:
typing.Union[float, typing.List[float], NoneType] = None, image_std: typing.Union[float,
typing.List[float], NoneType] = None, do_convert_rgb: typing.Optional[bool] = None,
**kwargs )

Parameters

https://huggingface.co/docs/transformers/en/model_doc/vit 8/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• do_resize ( bool , optional, defaults to True ) — Whether to resize the image’s (height, width)
dimensions to the specified (size["height"], size["width"]) . Can be overridden by
the do_resize parameter in the preprocess method.

• size ( dict , optional, defaults to {"height" -- 224, "width": 224} ): Size of the output
image after resizing. Can be overridden by the size parameter in the preprocess method.

• resample ( PILImageResampling , optional, defaults to Resampling.BILINEAR ) —


Resampling filter to use if resizing the image. Can be overridden by the resample parameter
in the preprocess method.

• do_rescale ( bool , optional, defaults to True ) — Whether to rescale the image by the
specified scale rescale_factor .Expand
Can be 9overridden by the do_rescale parameter in the
parameters
preprocess method.

• rescale_factor ( int or float , optional, defaults to 1/255 ) — Scale factor to use if rescaling

Constructs a ViT image processor.

preprocess <>

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray,


ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')],
typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]], do_resize:
typing.Optional[bool] = None, size: typing.Dict[str, int] = None, resample: Resampling =
None, do_rescale: typing.Optional[bool] = None, rescale_factor: typing.Optional[float] =
None, do_normalize: typing.Optional[bool] = None, image_mean: typing.Union[float,
typing.List[float], NoneType] = None, image_std: typing.Union[float, typing.List[float],
NoneType] = None, return_tensors: typing.Union[str,
transformers.utils.generic.TensorType, NoneType] = 'pt', data_format: typing.Union[str,
transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>,
input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension,
NoneType] = None, do_convert_rgb: typing.Optional[bool] = None, **kwargs )

Parameters

• images ( ImageInput ) — Image to preprocess. Expects a single or batch of images with


pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1,
set do_rescale=False .

• do_resize ( bool , optional, defaults to self.do_resize ) — Whether to resize the image.

• size ( Dict[str, int] , optional, defaults to self.size ) — Dictionary in the format


{"height": h, "width": w} specifying the size of the output image after resizing.

• resample ( PILImageResampling filter, optional, defaults to self.resample ) —


PILImageResampling filter to use if resizing the image e.g.
PILImageResampling.BILINEAR . Only has an effect if do_resize is set to True .
https://huggingface.co/docs/transformers/en/model_doc/vit 9/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• Expand
do_rescale ( bool , optional, defaults to 12 parameters
self.do_rescale ) — Whether to rescale the
image values between [0 - 1].

• rescale factor ( float optional defaults to self rescale factor ) Rescale factor to

Preprocess an image or batch of images.

do_convert_rgb ( bool , optional): Whether to convert the image to RGB.

Pytorch Hide Pytorch content

ViTModel

class transformers.ViTModel <>

( config: ViTConfig, add_pooling_layer: bool = True, use_mask_token: bool = False )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

The bare ViT Model transformer outputting raw hidden-states without any specific head
on top. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch
Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forward <>

( pixel_values: typing.Optional[torch.Tensor] = None, bool_masked_pos:


typing.Optional[torch.BoolTensor] = None, head_mask: typing.Optional[torch.Tensor]
= None, output_attentions: typing.Optional[bool] = None, output_hidden_states:
typing.Optional[bool] = None, interpolate_pos_encoding: typing.Optional[bool] =
None, return_dict: typing.Optional[bool] = None ) →
transformers.modeling_outputs.BaseModelOutputWithPooling or
tuple(torch.FloatTensor)

Parameters

https://huggingface.co/docs/transformers/en/model_doc/vit 10/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,


height, width) ) — Pixel values. Pixel values can be obtained using
AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,


num_heads) , optional) — Mask to nullify selected heads of the self-attention
modules. Mask values selected in [0, 1] :

•1 indicates the head is not masked,


•0 indicates the head is masked.
Expand 7 parameters

• output_attentions ( bool , optional) — Whether or not to return the attentions


tensors of all attention layers See attentions under returned tensors for more

The ViTModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.

Example:

>>> from transformers import AutoImageProcessor, ViTModel


>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)


>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat


>>> model = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():


... outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state


>>> list(last_hidden_states.shape)
[1, 197, 768]

https://huggingface.co/docs/transformers/en/model_doc/vit 11/26
12/23/24, 5:35 PM Vision Transformer (ViT)

ViTForMaskedImageModeling

class transformers.ViTForMaskedImageModeling <>

( config: ViTConfig )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

ViT Model with a decoder on top for masked image modeling, as proposed in SimMIM.

Note that we provide a script to pre-train this model on custom data in our examples
directory.

This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module


and refer to the PyTorch documentation for all matter related to general usage and
behavior.

forward <>

( pixel_values: typing.Optional[torch.Tensor] = None, bool_masked_pos:


typing.Optional[torch.BoolTensor] = None, head_mask: typing.Optional[torch.Tensor]
= None, output_attentions: typing.Optional[bool] = None, output_hidden_states:
typing.Optional[bool] = None, interpolate_pos_encoding: typing.Optional[bool] =
None, return_dict: typing.Optional[bool] = None ) →
transformers.modeling_outputs.MaskedImageModelingOutput or
tuple(torch.FloatTensor)

Parameters

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,


height, width) ) — Pixel values. Pixel values can be obtained using
AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,


num_heads) , optional) — Mask to nullify selected heads of the self-attention
modules. Mask values selected in [0, 1] :
https://huggingface.co/docs/transformers/en/model_doc/vit 12/26
12/23/24, 5:35 PM Vision Transformer (ViT)

•1 indicates the head is not masked,


•0 indicates the head is masked.
• output_attentions ( bool , optional) — Whether or not to return the attentions
tensors of all attention layers.Expand 7 parameters
See attentions under returned tensors for more
detail.

• output hidden states ( bool optional) Whether or not to return the hidden

The ViTForMaskedImageModeling forward method, overrides the __call__ special


method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.

Examples:

>>> from transformers import AutoImageProcessor, ViTForMaskedImageModeling


>>> import torch
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"


>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat


>>> model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch1

>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2


>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_v
>>> # create random boolean mask of shape (batch_size, num_patches)
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).boo

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)


>>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
>>> list(reconstructed_pixel_values.shape)
[1, 3, 224, 224]

ViTForImageClassification
https://huggingface.co/docs/transformers/en/model_doc/vit 13/26
12/23/24, 5:35 PM Vision Transformer (ViT)

class transformers.ViTForImageClassification <>

( config: ViTConfig )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.

Note that it’s possible to fine-tune ViT on higher resolution images than the ones it has
been trained on, by setting interpolate_pos_encoding to True in the forward of
the model. This will interpolate the pre-trained position embeddings to the higher
resolution.

This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module


and refer to the PyTorch documentation for all matter related to general usage and
behavior.

forward <>

( pixel_values: typing.Optional[torch.Tensor] = None, head_mask:


typing.Optional[torch.Tensor] = None, labels: typing.Optional[torch.Tensor] = None,
output_attentions: typing.Optional[bool] = None, output_hidden_states:
typing.Optional[bool] = None, interpolate_pos_encoding: typing.Optional[bool] =
None, return_dict: typing.Optional[bool] = None ) →
transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

Parameters

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,


height, width) ) — Pixel values. Pixel values can be obtained using
AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,


num_heads) , optional) — Mask to nullify selected heads of the self-attention
modules. Mask values selected in [0, 1] :
https://huggingface.co/docs/transformers/en/model_doc/vit 14/26
12/23/24, 5:35 PM Vision Transformer (ViT)

•1 indicates the head is not masked,


•0 indicates the head is masked.
• output_attentions ( bool , optional) — Whether or not to return the attentions
tensors of all attention layers.Expand 7 parameters
See attentions under returned tensors for more
detail.

o tp t hidden states ( b l ti l) Wh th tt t th hidd

The ViTForImageClassification forward method, overrides the __call__ special


method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.

Example:

>>> from transformers import AutoImageProcessor, ViTForImageClassification


>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)


>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat


>>> model = ViTForImageClassification.from_pretrained("google/vit-base-patch1

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():


... logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes


>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
Egyptian cat

TensorFlow Hide TensorFlow content

TFViTModel
https://huggingface.co/docs/transformers/en/model_doc/vit 15/26
12/23/24, 5:35 PM Vision Transformer (ViT)

class transformers.TFViTModel <>

( config: ViTConfig, *inputs, add_pooling_layer = True, **kwargs )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

The bare ViT Model transformer outputting raw hidden-states without any specific head
on top.

This model inherits from TFPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading or
saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer
to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or


•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format
when passing inputs to models and layers. Because of this support, when using
methods like model.fit() things should “just work” for you - just pass your inputs
and labels in any format that model.fit() supports! If, however, you want to use the
second format outside of Keras methods like fit() and predict() , such as when
creating your own layers or models with the Keras Functional API, there are three
possibilities you can use to gather all the input Tensors in the first positional argument:

•a single Tensor with pixel_values only and nothing else:


model(pixel_values)

•a list of varying length with one or several input Tensors IN THE ORDER given in
the docstring: model([pixel_values, attention_mask]) or
model([pixel_values, attention_mask, token_type_ids])

https://huggingface.co/docs/transformers/en/model_doc/vit 16/26
12/23/24, 5:35 PM Vision Transformer (ViT)

•a dictionary with one or several input Tensors associated to the input names
given in the docstring: model({"pixel_values": pixel_values,
"token_type_ids": token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to
worry about any of this, as you can just pass inputs like you would to any other Python
function!

call <>

( pixel_values: TFModelInputType | None = None, head_mask: np.ndarray | tf.Tensor |


None = None, output_attentions: Optional[bool] = None, output_hidden_states:
Optional[bool] = None, interpolate_pos_encoding: Optional[bool] = None,
return_dict: Optional[bool] = None, training: bool = False ) →
transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or
tuple(tf.Tensor)

Parameters

• pixel_values ( np.ndarray , tf.Tensor , List[tf.Tensor] ` Dict[str,


tf.Tensor] or Dict[str, np.ndarray] and each example must have the shape
(batch_size, num_channels, height, width) ) — Pixel values. Pixel values can
be obtained using AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( np.ndarray or tf.Tensor of shape (num_heads,) or


(num_layers, num_heads) , optional) — Mask to nullify selected heads of the self-
attention modules. Mask values selected in [0, 1] :

•1 indicates the head is not masked,


•0 indicates the head is masked.
Expand 7 parameters
• output_attentions ( bool , optional) — Whether or not to return the attentions
tensors of all attention layers. See attentions under returned tensors for more
d t il Thi t b d l i d i h d th l i

The TFViTModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.

https://huggingface.co/docs/transformers/en/model_doc/vit 17/26
12/23/24, 5:35 PM Vision Transformer (ViT)

Example:

>>> from transformers import AutoImageProcessor, TFViTModel


>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)


>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat


>>> model = TFViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(image, return_tensors="tf")


>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state


>>> list(last_hidden_states.shape)
[1, 197, 768]

TFViTForImageClassification

class transformers.TFViTForImageClassification <>

( config: ViTConfig, *inputs, **kwargs )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.

Note that it’s possible to fine-tune ViT on higher resolution images than the ones it has
been trained on, by setting interpolate_pos_encoding to True in the forward of
the model. This will interpolate the pre-trained position embeddings to the higher
resolution.

https://huggingface.co/docs/transformers/en/model_doc/vit 18/26
12/23/24, 5:35 PM Vision Transformer (ViT)

This model inherits from TFPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading or
saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer
to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or


•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format
when passing inputs to models and layers. Because of this support, when using
methods like model.fit() things should “just work” for you - just pass your inputs
and labels in any format that model.fit() supports! If, however, you want to use the
second format outside of Keras methods like fit() and predict() , such as when
creating your own layers or models with the Keras Functional API, there are three
possibilities you can use to gather all the input Tensors in the first positional argument:

•a single Tensor with pixel_values only and nothing else:


model(pixel_values)

•a list of varying length with one or several input Tensors IN THE ORDER given in
the docstring: model([pixel_values, attention_mask]) or
model([pixel_values, attention_mask, token_type_ids])

•a dictionary with one or several input Tensors associated to the input names
given in the docstring: model({"pixel_values": pixel_values,
"token_type_ids": token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to
worry about any of this, as you can just pass inputs like you would to any other Python
function!

call <>

( pixel_values: TFModelInputType | None = None, head_mask: np.ndarray | tf.Tensor |


None = None, output_attentions: Optional[bool] = None, output_hidden_states:
Optional[bool] = None, interpolate_pos_encoding: Optional[bool] = None,
return_dict: Optional[bool] = None, labels: np.ndarray | tf.Tensor | None = None,
training: Optional[bool] = False ) →
transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)

https://huggingface.co/docs/transformers/en/model_doc/vit 19/26
12/23/24, 5:35 PM Vision Transformer (ViT)

Parameters

• pixel_values ( np.ndarray , tf.Tensor , List[tf.Tensor] ` Dict[str,


tf.Tensor] or Dict[str, np.ndarray] and each example must have the shape
(batch_size, num_channels, height, width) ) — Pixel values. Pixel values can
be obtained using AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( np.ndarray or tf.Tensor of shape (num_heads,) or


(num_layers, num_heads) , optional) — Mask to nullify selected heads of the self-
attention modules. Mask values selected in [0, 1] :

•1 indicates the head is not masked,


•0 indicates the head is masked.
Expand 8 parameters
• output_attentions ( bool , optional) — Whether or not to return the attentions
tensors of all attention layers. See attentions under returned tensors for more
d t il Thi t b d l i d i h d th l i

The TFViTForImageClassification forward method, overrides the __call__ special


method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.

Example:

>>> from transformers import AutoImageProcessor, TFViTForImageClassification


>>> import tensorflow as tf
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)


>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat


>>> model = TFViTForImageClassification.from_pretrained("google/vit-base-patc

>>> inputs = image_processor(image, return_tensors="tf")


>>> logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes


>>> predicted_label = int(tf.math.argmax(logits, axis=-1))

https://huggingface.co/docs/transformers/en/model_doc/vit 20/26
12/23/24, 5:35 PM Vision Transformer (ViT)

>>> print(model.config.id2label[predicted_label])
Egyptian cat

JAX Hide JAX content

FlaxVitModel

class transformers.FlaxViTModel <>

( config: ViTConfig, input_shape = None, seed: int = 0, dtype: dtype = <class


'jax.numpy.float32'>, _do_init: bool = True, **kwargs )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data


type of the computation. Can be one of jax.numpy.float32 , jax.numpy.float16
(on GPUs) and jax.numpy.bfloat16 (on TPUs).

This can be used to enable mixed-precision training or half-precision inference on GPUs


or TPUs. If specified all the computation will be performed with the given dtype .

Note that this only specifies the dtype of the computation and does not influence the
dtype of model parameters.

If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

The bare ViT Model transformer outputting raw hidden-states without any specific head
on top.

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading,
saving and converting weights from PyTorch models)

This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and
refer to the Flax documentation for all matter related to general usage and behavior.

https://huggingface.co/docs/transformers/en/model_doc/vit 21/26
12/23/24, 5:35 PM Vision Transformer (ViT)

Finally, this model supports inherent JAX features such as:

• Just-In-Time (JIT) compilation


• Automatic Differentiation
• Vectorization
• Parallelization

__call__ <>

( pixel_values, params: dict = None, dropout_rng: <function PRNGKey at


0x7f50727b7640> = None, train: bool = False, output_attentions:
typing.Optional[bool] = None, output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None ) →
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or
tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or
Returns
tuple(torch.FloatTensor)

A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when
config.return_dict=False ) comprising various elements depending on the
configuration ( <class
'transformers.models.vit.configuration_vit.ViTConfig'> ) and inputs.

•last_hidden_state ( jnp.ndarray of shape (batch_size, sequence_length,


hidden_size) ) — Sequence of hidden-states at the output of the last layer of the
model.
Expand undefined parameters

•pooler_output ( jnp.ndarray of shape (batch_size, hidden_size) ) — Last


layer hidden-state of the first token of the sequence (classification token) further

The FlaxViTPreTrainedModel forward method, overrides the __call__ special


method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes

https://huggingface.co/docs/transformers/en/model_doc/vit 22/26
12/23/24, 5:35 PM Vision Transformer (ViT)

care of running the pre and post processing steps while the latter silently ignores
them.

Examples:

>>> from transformers import AutoImageProcessor, FlaxViTModel


>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"


>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat


>>> model = FlaxViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(images=image, return_tensors="np")


>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state

FlaxViTForImageClassification

class transformers.FlaxViTForImageClassification <>

( config: ViTConfig, input_shape = None, seed: int = 0, dtype: dtype = <class


'jax.numpy.float32'>, _do_init: bool = True, **kwargs )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data


type of the computation. Can be one of jax.numpy.float32 , jax.numpy.float16
(on GPUs) and jax.numpy.bfloat16 (on TPUs).

This can be used to enable mixed-precision training or half-precision inference on GPUs


or TPUs. If specified all the computation will be performed with the given dtype .

Note that this only specifies the dtype of the computation and does not influence the
dtype of model parameters.
https://huggingface.co/docs/transformers/en/model_doc/vit 23/26
12/23/24, 5:35 PM Vision Transformer (ViT)

If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading,
saving and converting weights from PyTorch models)

This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and
refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

• Just-In-Time (JIT) compilation


• Automatic Differentiation
• Vectorization
• Parallelization

__call__ <>

( pixel_values, params: dict = None, dropout_rng: <function PRNGKey at


0x7f50727b7640> = None, train: bool = False, output_attentions:
typing.Optional[bool] = None, output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None ) →
transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or
tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or
Returns
tuple(torch.FloatTensor)

A transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when
config.return_dict=False ) comprising various elements depending on the
configuration ( <class
'transformers.models.vit.configuration_vit.ViTConfig'> ) and inputs.

•logits ( jnp.ndarray of shape (batch_size, config.num_labels) ) —


Classification (or regression if config.num_labels==1) scores (before SoftMax).
https://huggingface.co/docs/transformers/en/model_doc/vit 24/26
12/23/24, 5:35 PM Vision Transformer (ViT)

•hidden_states ( tuple(jnp.ndarray) , optional,


Expand undefined returned when
parameters
output_hidden_states=True is passed or when
config.output_hidden_states=True ) — Tuple of jnp.ndarray (one for the

The FlaxViTPreTrainedModel forward method, overrides the __call__ special


method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.

Example:

>>> from transformers import AutoImageProcessor, FlaxViTForImageClassificatio


>>> from PIL import Image
>>> import jax
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"


>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat


>>> model = FlaxViTForImageClassification.from_pretrained("google/vit-base-pa

>>> inputs = image_processor(images=image, return_tensors="np")


>>> outputs = model(**inputs)
>>> logits = outputs.logits

>>> # model predicts one of the 1000 ImageNet classes


>>> predicted_class_idx = jax.numpy.argmax(logits, axis=-1)
>>> print("Predicted class:", model.config.id2label[predicted_class_idx.item(

<> Update on GitHub

← VAN ViT Hybrid →

https://huggingface.co/docs/transformers/en/model_doc/vit 25/26
12/23/24, 5:35 PM Vision Transformer (ViT)

https://huggingface.co/docs/transformers/en/model_doc/vit 26/26

You might also like