0% found this document useful (0 votes)

102 views26 pages

Vision Transformer (ViT)

The Vision Transformer (ViT) model, introduced by Dosovitskiy et al., successfully applies a Transformer architecture to image classification tasks, achieving competitive results against traditional convolutional networks while requiring less computational resources. ViT operates by splitting images into patches, embedding them, and using a standard Transformer encoder for processing. Several follow-up models, such as DeiT and BEiT, have built upon ViT's framework to enhance efficiency and performance in image recognition tasks.

Uploaded by

birinchi.orcheetech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views26 pages

Vision Transformer (ViT)

Uploaded by

birinchi.orcheetech

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

12/23/24, 5:35 PM Vision Transformer (ViT)

Search models, datasets, users...

Transformers documentation
Vision Transformer (ViT)

Join the Hugging Face community

and get access to the augmented documentation experience

Sign Up to get started

Vision Transformer (ViT)

Overview

The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It’s the first paper that
successfully trains a Transformer encoder on ImageNet, attaining very good results compared to
familiar convolutional architectures.

The abstract from the paper is the following:

While the Transformer architecture has become the de-facto standard for natural language
processing tasks, its applications to computer vision remain limited. In vision, attention is either
applied in conjunction with convolutional networks, or used to replace certain components of
convolutional networks while keeping their overall structure in place. We show that this reliance
on CNNs is not necessary and a pure transformer applied directly to sequences of image patches
can perform very well on image classification tasks. When pre-trained on large amounts of data
and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-
100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art
convolutional networks while requiring substantially fewer computational resources to train.

https://huggingface.co/docs/transformers/en/model_doc/vit 1/26
12/23/24, 5:35 PM Vision Transformer (ViT)

ViT architecture. Taken from the original paper.

Following the original Vision Transformer, some follow-up works have been made:

DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision
transformers. The authors of DeiT also released more efficiently trained ViT models, which
you can directly plug into ViTModel or ViTForImageClassification. There are 4 variants
available (in 3 different sizes): facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-
224, facebook/deit-base-patch16-224 and facebook/deit-base-patch16-384. Note that one
should use DeiTImageProcessor in order to prepare images for the model.

BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models

outperform supervised pre-trained vision transformers using a self-supervised method
inspired by BERT (masked image modeling) and based on a VQ-VAE.

DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision
Transformers trained using the DINO method show very interesting properties not seen
with convolutional models. They are capable of segmenting objects, without having ever
been trained to do so. DINO checkpoints can be found on the hub.

MAE (Masked Autoencoders) by Facebook AI. By pre-training Vision Transformers to

reconstruct pixel values for a high portion (75%) of masked patches (using an asymmetric
encoder-decoder architecture), the authors show that this simple method outperforms
supervised pre-training after fine-tuning.

https://huggingface.co/docs/transformers/en/model_doc/vit 2/26
12/23/24, 5:35 PM Vision Transformer (ViT)

This model was contributed by nielsr. The original code (written in JAX) can be found here.

Note that we converted the weights from Ross Wightman’s timm library, who already converted
the weights from JAX to PyTorch. Credits go to him!

Usage tips

To feed images to the Transformer encoder, each image is split into a sequence of fixed-size
non-overlapping patches, which are then linearly embedded. A [CLS] token is added to
serve as representation of an entire image, which can be used for classification. The
authors also add absolute position embeddings, and feed the resulting sequence of vectors
to a standard Transformer encoder.

As the Vision Transformer expects each image to be of the same size (resolution), one can
use ViTImageProcessor to resize (or rescale) and normalize images for the model.

Both the patch resolution and image resolution used during pre-training or fine-tuning are
reflected in the name of each checkpoint. For example, google/vit-base-patch16-224
refers to a base-sized architecture with patch resolution of 16x16 and fine-tuning resolution
of 224x224. All checkpoints can be found on the hub.

The available checkpoints are either (1) pre-trained on ImageNet-21k (a collection of 14

million images and 21k classes) only, or (2) also fine-tuned on ImageNet (also referred to as
ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).

The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it
is often beneficial to use a higher resolution than pre-training (Touvron et al., 2019),
(Kolesnikov et al., 2020). In order to fine-tune at higher resolution, the authors perform 2D
interpolation of the pre-trained position embeddings, according to their location in the
original image.

The best results are obtained with supervised pre-training, which is not the case in NLP. The
authors also performed an experiment with a self-supervised pre-training objective,
namely masked patched prediction (inspired by masked language modeling). With this
approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant
improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

Using Scaled Dot Product Attention (SDPA)

https://huggingface.co/docs/transformers/en/model_doc/vit 3/26
12/23/24, 5:35 PM Vision Transformer (ViT)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of

torch.nn.functional . This function encompasses several implementations that can be

applied depending on the inputs and the hardware in use. See the official documentation or the
GPU Inference page for more information.

SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may
also set attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be
used.

from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224", attn_
...

For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16
or torch.bfloat16 ).

On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with float32 and
google/vit-base-patch16-224 model, we saw the following speedups during inference.

Batch Average inference time (ms), Average inference time (ms), Speed up, Sdpa /
size eager mode sdpa model Eager (x)

1 7 6 1.17

2 8 6 1.33

4 8 6 1.33

8 8 6 1.33

Resources

Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found
here. A list of official Hugging Face and community (indicated by 🌎) resources to help you get
started with ViT. If you’re interested in submitting a resource to be included here, please feel free
to open a Pull Request and we’ll review it! The resource should ideally demonstrate something
new instead of duplicating an existing resource.

https://huggingface.co/docs/transformers/en/model_doc/vit 4/26
12/23/24, 5:35 PM Vision Transformer (ViT)

ViTForImageClassification is supported by:

Image Classification

A blog post on how to Fine-Tune ViT for Image Classification with Hugging Face
Transformers

A blog post on Image Classification with Hugging Face Transformers and Keras

A notebook on Fine-tuning for Image Classification with Hugging Face Transformers

A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with the Hugging Face
Trainer

A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with PyTorch Lightning

⚗️ Optimization

A blog post on how to Accelerate Vision Transformer (ViT) with Quantization using Optimum

⚡️ Inference

A notebook on Quick demo: Vision Transformer (ViT) by Google Brain

🚀 Deploy

A blog post on Deploying Tensorflow Vision Models in Hugging Face with TF Serving

A blog post on Deploying Hugging Face ViT on Vertex AI

A blog post on Deploying Hugging Face ViT on Kubernetes with TF Serving

ViTConfig

class transformers.ViTConfig <>

( hidden_size = 768, num_hidden_layers = 12, num_attention_heads = 12, intermediate_size =

3072, hidden_act = 'gelu', hidden_dropout_prob = 0.0, attention_probs_dropout_prob = 0.0,
initializer_range = 0.02, layer_norm_eps = 1e-12, image_size = 224, patch_size = 16,
num_channels = 3, qkv_bias = True, encoder_stride = 16, **kwargs )

Parameters

https://huggingface.co/docs/transformers/en/model_doc/vit 5/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• hidden_size ( int , optional, defaults to 768) — Dimensionality of the encoder layers and the
pooler layer.

• num_hidden_layers ( int , optional, defaults to 12) — Number of hidden layers in the

Transformer encoder.

• num_attention_heads ( int , optional, defaults to 12) — Number of attention heads for each
attention layer in the Transformer encoder.

• intermediate_size ( int , optional, defaults to 3072) — Dimensionality of the “intermediate”

(i.e., feed-forward) layer in the Transformer encoder.

• hidden_act ( str or function , optional, defaults to "gelu" ) — The non-linear activation

Expand 14 parameters
function (function or string) in the encoder and pooler. If string, "gelu" , "relu" , "selu"
and "gelu_new" are supported.

This is the configuration class to store the configuration of a ViTModel. It is used to instantiate
an ViT model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the
ViT google/vit-base-patch16-224 architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model
outputs. Read the documentation from PretrainedConfig for more information.

Example:

>>> from transformers import ViTConfig, ViTModel

>>> # Initializing a ViT vit-base-patch16-224 style configuration

>>> configuration = ViTConfig()

>>> # Initializing a model (with random weights) from the vit-base-patch16-224 style
>>> model = ViTModel(configuration)

>>> # Accessing the model configuration

>>> configuration = model.config

ViTFeatureExtractor

class transformers.ViTFeatureExtractor <>

https://huggingface.co/docs/transformers/en/model_doc/vit 6/26
12/23/24, 5:35 PM Vision Transformer (ViT)

( *args, **kwargs )

__call__ <>

( images, **kwargs )

Preprocess an image or a batch of images.

ViTImageProcessor

class transformers.ViTImageProcessor <>

( do_resize: bool = True, size: typing.Optional[typing.Dict[str, int]] = None, resample:

Resampling = <Resampling.BILINEAR: 2>, do_rescale: bool = True, rescale_factor:
typing.Union[int, float] = 0.00392156862745098, do_normalize: bool = True, image_mean:
typing.Union[float, typing.List[float], NoneType] = None, image_std: typing.Union[float,
typing.List[float], NoneType] = None, do_convert_rgb: typing.Optional[bool] = None,
**kwargs )

Parameters

• do_resize ( bool , optional, defaults to True ) — Whether to resize the image’s (height, width)
dimensions to the specified (size["height"], size["width"]) . Can be overridden by
the do_resize parameter in the preprocess method.

• size ( dict , optional, defaults to {"height" -- 224, "width": 224} ): Size of the output
image after resizing. Can be overridden by the size parameter in the preprocess method.

• resample ( PILImageResampling , optional, defaults to Resampling.BILINEAR ) —

Resampling filter to use if resizing the image. Can be overridden by the resample parameter
in the preprocess method.

• do_rescale ( bool , optional, defaults to True ) — Whether to rescale the image by the
specified scale rescale_factor .Expand
Can be 9 parameters
overridden by the do_rescale parameter in the
preprocess method.

• rescale factor ( int or float optional defaults to 1/255 ) — Scale factor to use if rescaling

Constructs a ViT image processor.

preprocess <>

https://huggingface.co/docs/transformers/en/model_doc/vit 7/26
12/23/24, 5:35 PM Vision Transformer (ViT)

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray,

ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')],
typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]], do_resize:
typing.Optional[bool] = None, size: typing.Dict[str, int] = None, resample: Resampling =
None, do_rescale: typing.Optional[bool] = None, rescale_factor: typing.Optional[float] =
None, do_normalize: typing.Optional[bool] = None, image_mean: typing.Union[float,
typing.List[float], NoneType] = None, image_std: typing.Union[float, typing.List[float],
NoneType] = None, return_tensors: typing.Union[str,
transformers.utils.generic.TensorType, NoneType] = None, data_format: typing.Union[str,
transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>,
input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension,
NoneType] = None, do_convert_rgb: typing.Optional[bool] = None )

Parameters

• images ( ImageInput ) — Image to preprocess. Expects a single or batch of images with

pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1,
set do_rescale=False .

• do_resize ( bool , optional, defaults to self.do_resize ) — Whether to resize the image.

• size ( Dict[str, int] , optional, defaults to self.size ) — Dictionary in the format

{"height": h, "width": w} specifying the size of the output image after resizing.

• resample ( PILImageResampling filter, optional, defaults to self.resample ) —

PILImageResampling filter to use if resizing the image e.g.
PILImageResampling.BILINEAR . Only has an effect if do_resize is set to True .
Expand 13 parameters
• do_rescale ( bool , optional, defaults to self.do_rescale ) — Whether to rescale the
image values between [0 - 1].

Preprocess an image or batch of images.

ViTImageProcessorFast

class transformers.ViTImageProcessorFast <>

( do_resize: bool = True, size: typing.Optional[typing.Dict[str, int]] = None, resample:

Parameters

https://huggingface.co/docs/transformers/en/model_doc/vit 8/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• size ( dict , optional, defaults to {"height" -- 224, "width": 224} ): Size of the output
image after resizing. Can be overridden by the size parameter in the preprocess method.

• resample ( PILImageResampling , optional, defaults to Resampling.BILINEAR ) —

Resampling filter to use if resizing the image. Can be overridden by the resample parameter
in the preprocess method.

• do_rescale ( bool , optional, defaults to True ) — Whether to rescale the image by the
specified scale rescale_factor .Expand
Can be 9overridden by the do_rescale parameter in the
parameters
preprocess method.

• rescale_factor ( int or float , optional, defaults to 1/255 ) — Scale factor to use if rescaling

Constructs a ViT image processor.

preprocess <>

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray,

ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')],
typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]], do_resize:
typing.Optional[bool] = None, size: typing.Dict[str, int] = None, resample: Resampling =
None, do_rescale: typing.Optional[bool] = None, rescale_factor: typing.Optional[float] =
None, do_normalize: typing.Optional[bool] = None, image_mean: typing.Union[float,
typing.List[float], NoneType] = None, image_std: typing.Union[float, typing.List[float],
NoneType] = None, return_tensors: typing.Union[str,
transformers.utils.generic.TensorType, NoneType] = 'pt', data_format: typing.Union[str,
transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'>,
input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension,
NoneType] = None, do_convert_rgb: typing.Optional[bool] = None, **kwargs )

Parameters

• images ( ImageInput ) — Image to preprocess. Expects a single or batch of images with

pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1,
set do_rescale=False .

• do_resize ( bool , optional, defaults to self.do_resize ) — Whether to resize the image.

• size ( Dict[str, int] , optional, defaults to self.size ) — Dictionary in the format

{"height": h, "width": w} specifying the size of the output image after resizing.

• resample ( PILImageResampling filter, optional, defaults to self.resample ) —

PILImageResampling filter to use if resizing the image e.g.
PILImageResampling.BILINEAR . Only has an effect if do_resize is set to True .
https://huggingface.co/docs/transformers/en/model_doc/vit 9/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• Expand
do_rescale ( bool , optional, defaults to 12 parameters
self.do_rescale ) — Whether to rescale the
image values between [0 - 1].

• rescale factor ( float optional defaults to self rescale factor ) Rescale factor to

Preprocess an image or batch of images.

do_convert_rgb ( bool , optional): Whether to convert the image to RGB.

Pytorch Hide Pytorch content

ViTModel

class transformers.ViTModel <>

( config: ViTConfig, add_pooling_layer: bool = True, use_mask_token: bool = False )

Parameters

• config (ViTConfig) — Model configuration class with all the parameters of the model.
Initializing with a config file does not load the weights associated with the model, only
the configuration. Check out the from_pretrained() method to load the model weights.

The bare ViT Model transformer outputting raw hidden-states without any specific head
on top. This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch
Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forward <>

( pixel_values: typing.Optional[torch.Tensor] = None, bool_masked_pos:

typing.Optional[torch.BoolTensor] = None, head_mask: typing.Optional[torch.Tensor]
= None, output_attentions: typing.Optional[bool] = None, output_hidden_states:
typing.Optional[bool] = None, interpolate_pos_encoding: typing.Optional[bool] =
None, return_dict: typing.Optional[bool] = None ) →
transformers.modeling_outputs.BaseModelOutputWithPooling or
tuple(torch.FloatTensor)

Parameters

https://huggingface.co/docs/transformers/en/model_doc/vit 10/26
12/23/24, 5:35 PM Vision Transformer (ViT)

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,

height, width) ) — Pixel values. Pixel values can be obtained using
AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,

num_heads) , optional) — Mask to nullify selected heads of the self-attention
modules. Mask values selected in [0, 1] :

•1 indicates the head is not masked,

•0 indicates the head is masked.
Expand 7 parameters

• output_attentions ( bool , optional) — Whether or not to return the attentions

tensors of all attention layers See attentions under returned tensors for more

The ViTModel forward method, overrides the call special method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes
care of running the pre and post processing steps while the latter silently ignores
them.

Example:

>>> from transformers import AutoImageProcessor, ViTModel

>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)

>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> model = ViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():

... outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

>>> list(last_hidden_states.shape)
[1, 197, 768]

https://huggingface.co/docs/transformers/en/model_doc/vit 11/26
12/23/24, 5:35 PM Vision Transformer (ViT)

ViTForMaskedImageModeling

class transformers.ViTForMaskedImageModeling <>

( config: ViTConfig )

Parameters

ViT Model with a decoder on top for masked image modeling, as proposed in SimMIM.

Note that we provide a script to pre-train this model on custom data in our examples
directory.

This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module

and refer to the PyTorch documentation for all matter related to general usage and
behavior.

forward <>

( pixel_values: typing.Optional[torch.Tensor] = None, bool_masked_pos:

typing.Optional[torch.BoolTensor] = None, head_mask: typing.Optional[torch.Tensor]
= None, output_attentions: typing.Optional[bool] = None, output_hidden_states:
typing.Optional[bool] = None, interpolate_pos_encoding: typing.Optional[bool] =
None, return_dict: typing.Optional[bool] = None ) →
transformers.modeling_outputs.MaskedImageModelingOutput or
tuple(torch.FloatTensor)

Parameters

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,

height, width) ) — Pixel values. Pixel values can be obtained using
AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,

num_heads) , optional) — Mask to nullify selected heads of the self-attention
modules. Mask values selected in [0, 1] :
https://huggingface.co/docs/transformers/en/model_doc/vit 12/26
12/23/24, 5:35 PM Vision Transformer (ViT)

•1 indicates the head is not masked,

•0 indicates the head is masked.
• output_attentions ( bool , optional) — Whether or not to return the attentions
tensors of all attention layers.Expand 7 parameters
See attentions under returned tensors for more
detail.

• output hidden states ( bool optional) Whether or not to return the hidden

The ViTForMaskedImageModeling forward method, overrides the call special

method.

Examples:

>>> from transformers import AutoImageProcessor, ViTForMaskedImageModeling

>>> import torch
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"

>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch1

>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2

>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_v
>>> # create random boolean mask of shape (batch_size, num_patches)
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).boo

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)

>>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
>>> list(reconstructed_pixel_values.shape)
[1, 3, 224, 224]

ViTForImageClassification
https://huggingface.co/docs/transformers/en/model_doc/vit 13/26
12/23/24, 5:35 PM Vision Transformer (ViT)

class transformers.ViTForImageClassification <>

( config: ViTConfig )

Parameters

ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.

Note that it’s possible to fine-tune ViT on higher resolution images than the ones it has
been trained on, by setting interpolate_pos_encoding to True in the forward of
the model. This will interpolate the pre-trained position embeddings to the higher
resolution.

This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module

and refer to the PyTorch documentation for all matter related to general usage and
behavior.

forward <>

( pixel_values: typing.Optional[torch.Tensor] = None, head_mask:

typing.Optional[torch.Tensor] = None, labels: typing.Optional[torch.Tensor] = None,
output_attentions: typing.Optional[bool] = None, output_hidden_states:
typing.Optional[bool] = None, interpolate_pos_encoding: typing.Optional[bool] =
None, return_dict: typing.Optional[bool] = None ) →
transformers.modeling_outputs.ImageClassifierOutput or tuple(torch.FloatTensor)

Parameters

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,

height, width) ) — Pixel values. Pixel values can be obtained using
AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,

•1 indicates the head is not masked,

o tp t hidden states ( b l ti l) Wh th tt t th hidd

The ViTForImageClassification forward method, overrides the call special

method.

Example:

>>> from transformers import AutoImageProcessor, ViTForImageClassification

>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)

>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> model = ViTForImageClassification.from_pretrained("google/vit-base-patch1

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():

... logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes

>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
Egyptian cat

TensorFlow Hide TensorFlow content

TFViTModel
https://huggingface.co/docs/transformers/en/model_doc/vit 15/26
12/23/24, 5:35 PM Vision Transformer (ViT)

class transformers.TFViTModel <>

( config: ViTConfig, *inputs, add_pooling_layer = True, **kwargs )

Parameters

The bare ViT Model transformer outputting raw hidden-states without any specific head
on top.

This model inherits from TFPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading or
saving, resizing the input embeddings, pruning heads etc.)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer
to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or

•having all inputs as a list, tuple or dict in the first positional argument.
The reason the second format is supported is that Keras methods prefer this format
when passing inputs to models and layers. Because of this support, when using
methods like model.fit() things should “just work” for you - just pass your inputs
and labels in any format that model.fit() supports! If, however, you want to use the
second format outside of Keras methods like fit() and predict() , such as when
creating your own layers or models with the Keras Functional API, there are three
possibilities you can use to gather all the input Tensors in the first positional argument:

•a single Tensor with pixel_values only and nothing else:

model(pixel_values)

•a list of varying length with one or several input Tensors IN THE ORDER given in
the docstring: model([pixel_values, attention_mask]) or
model([pixel_values, attention_mask, token_type_ids])

https://huggingface.co/docs/transformers/en/model_doc/vit 16/26
12/23/24, 5:35 PM Vision Transformer (ViT)

•a dictionary with one or several input Tensors associated to the input names
given in the docstring: model({"pixel_values": pixel_values,
"token_type_ids": token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to
worry about any of this, as you can just pass inputs like you would to any other Python
function!

call <>

( pixel_values: TFModelInputType | None = None, head_mask: np.ndarray | tf.Tensor |

None = None, output_attentions: Optional[bool] = None, output_hidden_states:
Optional[bool] = None, interpolate_pos_encoding: Optional[bool] = None,
return_dict: Optional[bool] = None, training: bool = False ) →
transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling or
tuple(tf.Tensor)

Parameters

• pixel_values ( np.ndarray , tf.Tensor , List[tf.Tensor] ` Dict[str,

tf.Tensor] or Dict[str, np.ndarray] and each example must have the shape
(batch_size, num_channels, height, width) ) — Pixel values. Pixel values can
be obtained using AutoImageProcessor. See ViTImageProcessor.call() for details.

• head_mask ( np.ndarray or tf.Tensor of shape (num_heads,) or

(num_layers, num_heads) , optional) — Mask to nullify selected heads of the self-
attention modules. Mask values selected in [0, 1] :

•1 indicates the head is not masked,

•0 indicates the head is masked.
Expand 7 parameters
• output_attentions ( bool , optional) — Whether or not to return the attentions
tensors of all attention layers. See attentions under returned tensors for more
d t il Thi t b d l i d i h d th l i

The TFViTModel forward method, overrides the call special method.

https://huggingface.co/docs/transformers/en/model_doc/vit 17/26
12/23/24, 5:35 PM Vision Transformer (ViT)

Example:

>>> from transformers import AutoImageProcessor, TFViTModel

>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)

>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> model = TFViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(image, return_tensors="tf")

>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state

>>> list(last_hidden_states.shape)
[1, 197, 768]

TFViTForImageClassification

class transformers.TFViTForImageClassification <>

( config: ViTConfig, *inputs, **kwargs )

Parameters

ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.

https://huggingface.co/docs/transformers/en/model_doc/vit 18/26
12/23/24, 5:35 PM Vision Transformer (ViT)

This model is also a keras.Model subclass. Use it as a regular TF 2.0 Keras Model and refer
to the TF 2.0 documentation for all matter related to general usage and behavior.

TensorFlow models and layers in transformers accept two formats as input:

•having all inputs as keyword arguments (like PyTorch models), or

•a single Tensor with pixel_values only and nothing else:

model(pixel_values)

•a list of varying length with one or several input Tensors IN THE ORDER given in
the docstring: model([pixel_values, attention_mask]) or
model([pixel_values, attention_mask, token_type_ids])

•a dictionary with one or several input Tensors associated to the input names
given in the docstring: model({"pixel_values": pixel_values,
"token_type_ids": token_type_ids})

Note that when creating models and layers with subclassing then you don’t need to
worry about any of this, as you can just pass inputs like you would to any other Python
function!

call <>

( pixel_values: TFModelInputType | None = None, head_mask: np.ndarray | tf.Tensor |

None = None, output_attentions: Optional[bool] = None, output_hidden_states:
Optional[bool] = None, interpolate_pos_encoding: Optional[bool] = None,
return_dict: Optional[bool] = None, labels: np.ndarray | tf.Tensor | None = None,
training: Optional[bool] = False ) →
transformers.modeling_tf_outputs.TFSequenceClassifierOutput or tuple(tf.Tensor)

https://huggingface.co/docs/transformers/en/model_doc/vit 19/26
12/23/24, 5:35 PM Vision Transformer (ViT)

Parameters

• pixel_values ( np.ndarray , tf.Tensor , List[tf.Tensor] ` Dict[str,

• head_mask ( np.ndarray or tf.Tensor of shape (num_heads,) or

(num_layers, num_heads) , optional) — Mask to nullify selected heads of the self-
attention modules. Mask values selected in [0, 1] :

•1 indicates the head is not masked,

•0 indicates the head is masked.
Expand 8 parameters
• output_attentions ( bool , optional) — Whether or not to return the attentions
tensors of all attention layers. See attentions under returned tensors for more
d t il Thi t b d l i d i h d th l i

The TFViTForImageClassification forward method, overrides the call special

method.

Example:

>>> from transformers import AutoImageProcessor, TFViTForImageClassification

>>> import tensorflow as tf
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)

>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> model = TFViTForImageClassification.from_pretrained("google/vit-base-patc

>>> inputs = image_processor(image, return_tensors="tf")

>>> logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes

>>> predicted_label = int(tf.math.argmax(logits, axis=-1))

https://huggingface.co/docs/transformers/en/model_doc/vit 20/26
12/23/24, 5:35 PM Vision Transformer (ViT)

>>> print(model.config.id2label[predicted_label])
Egyptian cat

JAX Hide JAX content

FlaxVitModel

class transformers.FlaxViTModel <>

( config: ViTConfig, input_shape = None, seed: int = 0, dtype: dtype = <class

'jax.numpy.float32'>, _do_init: bool = True, **kwargs )

Parameters

• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data

type of the computation. Can be one of jax.numpy.float32 , jax.numpy.float16
(on GPUs) and jax.numpy.bfloat16 (on TPUs).

This can be used to enable mixed-precision training or half-precision inference on GPUs

or TPUs. If specified all the computation will be performed with the given dtype .

Note that this only specifies the dtype of the computation and does not influence the
dtype of model parameters.

If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

The bare ViT Model transformer outputting raw hidden-states without any specific head
on top.

This model inherits from FlaxPreTrainedModel. Check the superclass documentation for
the generic methods the library implements for all its model (such as downloading,
saving and converting weights from PyTorch models)

This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and
refer to the Flax documentation for all matter related to general usage and behavior.

https://huggingface.co/docs/transformers/en/model_doc/vit 21/26
12/23/24, 5:35 PM Vision Transformer (ViT)

Finally, this model supports inherent JAX features such as:

• Just-In-Time (JIT) compilation

• Automatic Differentiation
• Vectorization
• Parallelization

__call__ <>

( pixel_values, params: dict = None, dropout_rng: <function PRNGKey at

0x7f50727b7640> = None, train: bool = False, output_attentions:
typing.Optional[bool] = None, output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None ) →
transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or
tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or
Returns
tuple(torch.FloatTensor)

A transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling or a tuple of
torch.FloatTensor (if return_dict=False is passed or when
config.return_dict=False ) comprising various elements depending on the
configuration ( <class
'transformers.models.vit.configuration_vit.ViTConfig'> ) and inputs.

•last_hidden_state ( jnp.ndarray of shape (batch_size, sequence_length,

hidden_size) ) — Sequence of hidden-states at the output of the last layer of the
model.
Expand undefined parameters

•pooler_output ( jnp.ndarray of shape (batch_size, hidden_size) ) — Last

layer hidden-state of the first token of the sequence (classification token) further

The FlaxViTPreTrainedModel forward method, overrides the call special

method.

Although the recipe for forward pass needs to be defined within this function, one
should call the Module instance afterwards instead of this since the former takes

https://huggingface.co/docs/transformers/en/model_doc/vit 22/26
12/23/24, 5:35 PM Vision Transformer (ViT)

care of running the pre and post processing steps while the latter silently ignores
them.

Examples:

>>> from transformers import AutoImageProcessor, FlaxViTModel

>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"

>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> model = FlaxViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(images=image, return_tensors="np")

>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state

FlaxViTForImageClassification

class transformers.FlaxViTForImageClassification <>

( config: ViTConfig, input_shape = None, seed: int = 0, dtype: dtype = <class

'jax.numpy.float32'>, _do_init: bool = True, **kwargs )

Parameters

• dtype ( jax.numpy.dtype , optional, defaults to jax.numpy.float32 ) — The data

type of the computation. Can be one of jax.numpy.float32 , jax.numpy.float16
(on GPUs) and jax.numpy.bfloat16 (on TPUs).

This can be used to enable mixed-precision training or half-precision inference on GPUs

or TPUs. If specified all the computation will be performed with the given dtype .

Note that this only specifies the dtype of the computation and does not influence the
dtype of model parameters.
https://huggingface.co/docs/transformers/en/model_doc/vit 23/26
12/23/24, 5:35 PM Vision Transformer (ViT)

If you wish to change the dtype of the model parameters, see to_fp16() and to_bf16().

ViT Model transformer with an image classification head on top (a linear layer on top of
the final hidden state of the [CLS] token) e.g. for ImageNet.

This model is also a flax.linen.Module subclass. Use it as a regular Flax linen Module and
refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

• Just-In-Time (JIT) compilation

• Automatic Differentiation
• Vectorization
• Parallelization

__call__ <>

( pixel_values, params: dict = None, dropout_rng: <function PRNGKey at

0x7f50727b7640> = None, train: bool = False, output_attentions:
typing.Optional[bool] = None, output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None ) →
transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or
tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or
Returns
tuple(torch.FloatTensor)

A transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when
config.return_dict=False ) comprising various elements depending on the
configuration ( <class
'transformers.models.vit.configuration_vit.ViTConfig'> ) and inputs.

•logits ( jnp.ndarray of shape (batch_size, config.num_labels) ) —

Classification (or regression if config.num_labels==1) scores (before SoftMax).
https://huggingface.co/docs/transformers/en/model_doc/vit 24/26
12/23/24, 5:35 PM Vision Transformer (ViT)

•hidden_states ( tuple(jnp.ndarray) , optional,

Expand undefined returned when
parameters
output_hidden_states=True is passed or when
config.output_hidden_states=True ) — Tuple of jnp.ndarray (one for the

The FlaxViTPreTrainedModel forward method, overrides the call special

method.

Example:

>>> from transformers import AutoImageProcessor, FlaxViTForImageClassificatio

>>> from PIL import Image
>>> import jax
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"

>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> model = FlaxViTForImageClassification.from_pretrained("google/vit-base-pa

>>> inputs = image_processor(images=image, return_tensors="np")

>>> outputs = model(**inputs)
>>> logits = outputs.logits

>>> # model predicts one of the 1000 ImageNet classes

>>> predicted_class_idx = jax.numpy.argmax(logits, axis=-1)
>>> print("Predicted class:", model.config.id2label[predicted_class_idx.item(

<> Update on GitHub

← VAN ViT Hybrid →

https://huggingface.co/docs/transformers/en/model_doc/vit 25/26
12/23/24, 5:35 PM Vision Transformer (ViT)

https://huggingface.co/docs/transformers/en/model_doc/vit 26/26

Research Notes
No ratings yet
Research Notes
9 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Deit Iii: Revenge of The Vit: Hugo Touvron Matthieu Cord Herv E J Egou Meta Ai Sorbonne University
No ratings yet
Deit Iii: Revenge of The Vit: Hugo Touvron Matthieu Cord Herv E J Egou Meta Ai Sorbonne University
27 pages
Vision Transformer: Revolutionizing Computer Vision
No ratings yet
Vision Transformer: Revolutionizing Computer Vision
13 pages
Vision Transformers for Dense Prediction
No ratings yet
Vision Transformers for Dense Prediction
22 pages
LLM
No ratings yet
LLM
28 pages
XXXBetter Plain ViT Baselines For ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines For ImageNet-1k
3 pages
2151 6982 1 SM
No ratings yet
2151 6982 1 SM
6 pages
2024 GVT Shan Chen Arxiv
No ratings yet
2024 GVT Shan Chen Arxiv
9 pages
ViT Fine-Tuning For NSFW Classification
No ratings yet
ViT Fine-Tuning For NSFW Classification
53 pages
LeViT A Vision Transformer in ConvNets Clothing For Faster Inference
No ratings yet
LeViT A Vision Transformer in ConvNets Clothing For Faster Inference
11 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
No ratings yet
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
11 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
869 When Vision Transformers Outpe
No ratings yet
869 When Vision Transformers Outpe
20 pages
AE-ViT: Enhancing Vision Transformers
No ratings yet
AE-ViT: Enhancing Vision Transformers
12 pages
ViTA A Vision Transformer Inference Accelerator For Edge Applications
No ratings yet
ViTA A Vision Transformer Inference Accelerator For Edge Applications
5 pages
ViT Robustness in Image Classification
No ratings yet
ViT Robustness in Image Classification
23 pages
Convolutional Vision Transformers
No ratings yet
Convolutional Vision Transformers
10 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
Vision Transformer U1
No ratings yet
Vision Transformer U1
42 pages
Abstract
No ratings yet
Abstract
2 pages
GPU友好稀疏量化Boost Vision Transformer
No ratings yet
GPU友好稀疏量化Boost Vision Transformer
11 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Vision Transformers for Autonomous Cars
No ratings yet
Vision Transformers for Autonomous Cars
9 pages
Gaurav Vision Transformer
No ratings yet
Gaurav Vision Transformer
10 pages
Vision Transformers in AI: Impact & Evolution
No ratings yet
Vision Transformers in AI: Impact & Evolution
3 pages
Singh Training Strategies For Vision Transformers For Object Detection CVPRW 2023 Paper
No ratings yet
Singh Training Strategies For Vision Transformers For Object Detection CVPRW 2023 Paper
9 pages
Mehta, Rastegari - 2022 - Mobilevit Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer
No ratings yet
Mehta, Rastegari - 2022 - Mobilevit Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer
26 pages
Scalable Vision Transformers With Hierarchical Pooling
No ratings yet
Scalable Vision Transformers With Hierarchical Pooling
11 pages
Transformers For Vision A Survey On Innovative Methods For Computer Vision
No ratings yet
Transformers For Vision A Survey On Innovative Methods For Computer Vision
28 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
8th Sem Major - Project - PPT
No ratings yet
8th Sem Major - Project - PPT
22 pages
ViTamin: Scalable Vision Models for VLMs
No ratings yet
ViTamin: Scalable Vision Models for VLMs
13 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
No ratings yet
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
10 pages
Vision Transformer Understanding
No ratings yet
Vision Transformer Understanding
3 pages
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
No ratings yet
Li Et Al. - 2022 - EfficientFormer Vision Transformers at MobileNet Speed
19 pages
BinaryViT：高效、精确的二值ViT
No ratings yet
BinaryViT：高效、精确的二值ViT
12 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
Robustness in Vision Transformers
No ratings yet
Robustness in Vision Transformers
17 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Vision Transformers For Vein Biometric Recognition
No ratings yet
Vision Transformers For Vein Biometric Recognition
23 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
23 pages
Vision Transformers Overview
No ratings yet
Vision Transformers Overview
28 pages
VQA ViT
No ratings yet
VQA ViT
24 pages
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
No ratings yet
NeurIPS 2021 Not All Images Are Worth 16x16 Words Dynamic Transformers For Efficient Image Recognition Paper
14 pages
Wadekar, Chaurasia - 2022 - MobileViTv3 Mobile-Friendly Vision Transformer With Simple and Effective Fusion of Local, Global and Input F
No ratings yet
Wadekar, Chaurasia - 2022 - MobileViTv3 Mobile-Friendly Vision Transformer With Simple and Effective Fusion of Local, Global and Input F
20 pages
Gitpose: Going Shallow and Deeper Using Vision Transformers For Human Pose Estimation
No ratings yet
Gitpose: Going Shallow and Deeper Using Vision Transformers For Human Pose Estimation
14 pages
Ai Int Arijit Dey PDF
No ratings yet
Ai Int Arijit Dey PDF
19 pages
ppt2 New
No ratings yet
ppt2 New
30 pages
ConvNeXt - A ConvNet For The 2020s
No ratings yet
ConvNeXt - A ConvNet For The 2020s
15 pages
Efficient V It
No ratings yet
Efficient V It
11 pages
Minor Presentation
No ratings yet
Minor Presentation
10 pages
Vision Transformer Overview
No ratings yet
Vision Transformer Overview
21 pages
Multi-Scale Vision Transformer
No ratings yet
Multi-Scale Vision Transformer
10 pages
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
No ratings yet
Ef Cient Training of Visual Transformers With Small Datasets - Liu Et Al
13 pages
Sensors 23 00170 v3
No ratings yet
Sensors 23 00170 v3
16 pages
Label Noise Detection in IoT Security Based On Decision Tree and Active Learning
No ratings yet
Label Noise Detection in IoT Security Based On Decision Tree and Active Learning
8 pages
IMPClustering Algorithms Based Noise Identification From Air Pollution Monitoring Data
No ratings yet
IMPClustering Algorithms Based Noise Identification From Air Pollution Monitoring Data
6 pages
5ANFIS and Deep Learning Based Missing Sensor Data Prediction in IoT
No ratings yet
5ANFIS and Deep Learning Based Missing Sensor Data Prediction in IoT
15 pages
3efficient Adaptive Noise Cancellation Techniques I
No ratings yet
3efficient Adaptive Noise Cancellation Techniques I
6 pages
IoT Anomaly Detection with Machine Learning
No ratings yet
IoT Anomaly Detection with Machine Learning
17 pages
62 BC
No ratings yet
62 BC
14 pages
YOLO: Real-Time Object Detection
No ratings yet
YOLO: Real-Time Object Detection
26 pages
Report
No ratings yet
Report
5 pages
How To Create Datasets - Strategies and Examples
No ratings yet
How To Create Datasets - Strategies and Examples
18 pages
Heuristic Algorithm for Min-VC
No ratings yet
Heuristic Algorithm for Min-VC
8 pages
Vertex Cover Problem Algorithm Analysis
No ratings yet
Vertex Cover Problem Algorithm Analysis
56 pages
Presentation
No ratings yet
Presentation
2 pages
4796 22345 1 PB
No ratings yet
4796 22345 1 PB
7 pages
In-Game Weapon Performance Analysis
No ratings yet
In-Game Weapon Performance Analysis
23 pages
Small LNG Carrier Design Innovations
No ratings yet
Small LNG Carrier Design Innovations
26 pages
Fuzzy Topsis Thesis
100% (2)
Fuzzy Topsis Thesis
6 pages
Bci NT-3
No ratings yet
Bci NT-3
96 pages
Employee Record Management System
No ratings yet
Employee Record Management System
8 pages
Nafyad N 2 High School Physics Midd Exam For Grade 10
No ratings yet
Nafyad N 2 High School Physics Midd Exam For Grade 10
2 pages
Ionic Equilibrium DPP
No ratings yet
Ionic Equilibrium DPP
24 pages
Maximum Parsimony Analysis Guide
No ratings yet
Maximum Parsimony Analysis Guide
9 pages
Leadership, Compensation & Job Satisfaction
No ratings yet
Leadership, Compensation & Job Satisfaction
8 pages
Slides - Chapter 11 - 2025
No ratings yet
Slides - Chapter 11 - 2025
72 pages
Instruction Manual: Hand Stacker Pa1015 Capacity 1000kg
No ratings yet
Instruction Manual: Hand Stacker Pa1015 Capacity 1000kg
17 pages
TSC Unit 3
No ratings yet
TSC Unit 3
2 pages
Bleach - Can't Fear Your Own World, Vol. 3
No ratings yet
Bleach - Can't Fear Your Own World, Vol. 3
263 pages
Chapter 4 Notes VI
No ratings yet
Chapter 4 Notes VI
2 pages
GR-5 - P-2-Portions
No ratings yet
GR-5 - P-2-Portions
3 pages
Detailed NOC Requirements Questionnaire
No ratings yet
Detailed NOC Requirements Questionnaire
6 pages
Chaos Theory of Career 2019
No ratings yet
Chaos Theory of Career 2019
15 pages
History Curriculum
No ratings yet
History Curriculum
136 pages
Hendy Et Al (2023) - How Good at GPT Models at Machine Translation-2
No ratings yet
Hendy Et Al (2023) - How Good at GPT Models at Machine Translation-2
30 pages
Bolted Joint Analysis and Preload Calculations
No ratings yet
Bolted Joint Analysis and Preload Calculations
2 pages
Exam Prep National Incident Management System Principles and Practice 2nd Edition HQ File Comprehensive
0% (1)
Exam Prep National Incident Management System Principles and Practice 2nd Edition HQ File Comprehensive
326 pages
Manual - Anycubic I3 Mega (English)
No ratings yet
Manual - Anycubic I3 Mega (English)
35 pages
Bihar JEE Main 2024 Admit Card Details
No ratings yet
Bihar JEE Main 2024 Admit Card Details
4 pages
Vocabulary Chapter 2
No ratings yet
Vocabulary Chapter 2
5 pages
5.3 ARM301 RRL - Lesson3 SummarizingLiteratureSources
No ratings yet
5.3 ARM301 RRL - Lesson3 SummarizingLiteratureSources
36 pages
Activate FEH
No ratings yet
Activate FEH
5 pages
Slings Catalog
No ratings yet
Slings Catalog
152 pages
Mosbys Respiratory Care Equipment 11th Edition Cairo Solution Manual Full Download
No ratings yet
Mosbys Respiratory Care Equipment 11th Edition Cairo Solution Manual Full Download
401 pages
Century Skills Theoretical and Practical Implications From Modern Research 51388682
No ratings yet
Century Skills Theoretical and Practical Implications From Modern Research 51388682
174 pages
Angles in A Quadrilateral
No ratings yet
Angles in A Quadrilateral
30 pages

Vision Transformer (ViT)

Uploaded by

Vision Transformer (ViT)

Uploaded by

12/23/24, 5:35 PM Vision Transformer (ViT)

Search models, datasets, users...

Join the Hugging Face community

Sign Up to get started

Vision Transformer (ViT)

The abstract from the paper is the following:

ViT architecture. Taken from the original paper.

BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models

MAE (Masked Autoencoders) by Facebook AI. By pre-training Vision Transformers to

The available checkpoints are either (1) pre-trained on ImageNet-21k (a collection of 14

Using Scaled Dot Product Attention (SDPA)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of

from transformers import ViTForImageClassification

ViTForImageClassification is supported by:

A notebook on Fine-tuning for Image Classification with Hugging Face Transformers

A notebook on Quick demo: Vision Transformer (ViT) by Google Brain

A blog post on Deploying Hugging Face ViT on Vertex AI

A blog post on Deploying Hugging Face ViT on Kubernetes with TF Serving

class transformers.ViTConfig <>

( hidden_size = 768, num_hidden_layers = 12, num_attention_heads = 12, intermediate_size =

• num_hidden_layers ( int , optional, defaults to 12) — Number of hidden layers in the

• intermediate_size ( int , optional, defaults to 3072) — Dimensionality of the “intermediate”

• hidden_act ( str or function , optional, defaults to "gelu" ) — The non-linear activation

>>> from transformers import ViTConfig, ViTModel

>>> # Initializing a ViT vit-base-patch16-224 style configuration

>>> # Accessing the model configuration

class transformers.ViTFeatureExtractor <>

Preprocess an image or a batch of images.

class transformers.ViTImageProcessor <>

( do_resize: bool = True, size: typing.Optional[typing.Dict[str, int]] = None, resample:

• resample ( PILImageResampling , optional, defaults to Resampling.BILINEAR ) —

Constructs a ViT image processor.

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray,

• images ( ImageInput ) — Image to preprocess. Expects a single or batch of images with

• do_resize ( bool , optional, defaults to self.do_resize ) — Whether to resize the image.

• size ( Dict[str, int] , optional, defaults to self.size ) — Dictionary in the format

• resample ( PILImageResampling filter, optional, defaults to self.resample ) —

Preprocess an image or batch of images.

class transformers.ViTImageProcessorFast <>

( do_resize: bool = True, size: typing.Optional[typing.Dict[str, int]] = None, resample:

• resample ( PILImageResampling , optional, defaults to Resampling.BILINEAR ) —

Constructs a ViT image processor.

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray,

• images ( ImageInput ) — Image to preprocess. Expects a single or batch of images with

• do_resize ( bool , optional, defaults to self.do_resize ) — Whether to resize the image.

• size ( Dict[str, int] , optional, defaults to self.size ) — Dictionary in the format

• resample ( PILImageResampling filter, optional, defaults to self.resample ) —

Preprocess an image or batch of images.

do_convert_rgb ( bool , optional): Whether to convert the image to RGB.

Pytorch Hide Pytorch content

class transformers.ViTModel <>

( config: ViTConfig, add_pooling_layer: bool = True, use_mask_token: bool = False )

( pixel_values: typing.Optional[torch.Tensor] = None, bool_masked_pos:

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,

•1 indicates the head is not masked,

• output_attentions ( bool , optional) — Whether or not to return the attentions

The ViTModel forward method, overrides the __call__ special method.

>>> from transformers import AutoImageProcessor, ViTModel

>>> dataset = load_dataset("huggingface/cats-image", trust_remote_code=True)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():

>>> last_hidden_states = outputs.last_hidden_state

class transformers.ViTForMaskedImageModeling <>

This model is a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module

( pixel_values: typing.Optional[torch.Tensor] = None, bool_masked_pos:

• pixel_values ( torch.FloatTensor of shape (batch_size, num_channels,

• head_mask ( torch.FloatTensor of shape (num_heads,) or (num_layers,

•1 indicates the head is not masked,

The ViTForMaskedImageModeling forward method, overrides the __call__ special

>>> from transformers import AutoImageProcessor, ViTForMaskedImageModeling

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-pat

>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)

The ViTModel forward method, overrides the call special method.

The ViTForMaskedImageModeling forward method, overrides the call special

The ViTForImageClassification forward method, overrides the call special

The TFViTModel forward method, overrides the call special method.

The TFViTForImageClassification forward method, overrides the call special

The FlaxViTPreTrainedModel forward method, overrides the call special

The FlaxViTPreTrainedModel forward method, overrides the call special