ViT
时间: 2025-04-26 20:09:15 浏览: 19
### Vision Transformer (ViT) Model Introduction
Vision Transformers (ViTs)[^1] represent an innovative approach to image processing that leverages the power of transformers originally designed for natural language processing tasks. Unlike Convolutional Neural Networks (CNNs), which rely heavily on local receptive fields and hierarchical feature extraction through convolution operations, ViTs treat images as sequences of patches.
In a typical setup, an input image is divided into fixed-size patches, each flattened and linearly embedded before being processed by multiple layers of multi-head self-attention mechanisms and position-wise feed-forward networks. This allows ViTs to capture global dependencies within images effectively but comes at the cost of increased computational requirements compared with CNNs[^2].
To address this issue, various optimizations have been proposed including lightweight architectures inspired by CNN's inductive biases or quantization techniques aimed at maintaining original architecture while reducing resource consumption without significant performance loss.
#### Code Example for Using Pre-trained ViT Models
For practical implementation using pre-trained models from libraries like Hugging Face:
```python
from transformers import ViTFeatureExtractor, ViTModel
import torch
from PIL import Image
import requests
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTModel.from_pretrained('google/vit-base-patch16-224')
inputs = feature_extractor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
```
This code snippet demonstrates how one can easily load a pretrained ViT model along with its corresponding feature extractor, preprocess an image, pass it through the network, and obtain output features.
阅读全文
相关推荐

















