CLIP

CLIP is a is a multimodal vision and language model motivated by overcoming the fixed number of object categories when training a computer vision model. CLIP learns about images directly from raw text by jointly training on 400M (image, text) pairs. Pretraining on this scale enables zero-shot transfer to downstream tasks. CLIP uses an image encoder and text encoder to get visual features and text features. Both features are projected to a latent space with the same number of dimensions and their dot product gives a similarity score.

You can find all the original CLIP checkpoints under the OpenAI organization.

Tip

Click on the CLIP models in the right sidebar for more examples of how to apply CLIP to different image and language tasks.

The example below demonstrates how to calculate similarity scores between multiple text descriptions and an image with [Pipeline] or the [AutoModel] class.

import torch
from transformers import pipeline

clip = pipeline(
   task="zero-shot-image-classification",
   model="openai/clip-vit-base-patch32",
   torch_dtype=torch.bfloat16,
   device=0
)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel

model = AutoModel.from_pretrained("openai/clip-vit-base-patch32", torch_dtype=torch.bfloat16, attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]

inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
most_likely_idx = probs.argmax(dim=1).item()
most_likely_label = labels[most_likely_idx]
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")

Notes

Use [CLIPImageProcessor] to resize (or rescale) and normalizes images for the model.

CLIPConfig

[[autodoc]] CLIPConfig - from_text_vision_configs

CLIPTextConfig

[[autodoc]] CLIPTextConfig

CLIPVisionConfig

[[autodoc]] CLIPVisionConfig

CLIPTokenizer

[[autodoc]] CLIPTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

CLIPTokenizerFast

[[autodoc]] CLIPTokenizerFast

CLIPImageProcessor

[[autodoc]] CLIPImageProcessor - preprocess

CLIPImageProcessorFast

[[autodoc]] CLIPImageProcessorFast - preprocess

CLIPFeatureExtractor

[[autodoc]] CLIPFeatureExtractor

CLIPProcessor

[[autodoc]] CLIPProcessor

CLIPModel

[[autodoc]] CLIPModel - forward - get_text_features - get_image_features

CLIPTextModel

[[autodoc]] CLIPTextModel - forward

CLIPTextModelWithProjection

[[autodoc]] CLIPTextModelWithProjection - forward

CLIPVisionModelWithProjection

[[autodoc]] CLIPVisionModelWithProjection - forward

CLIPVisionModel

[[autodoc]] CLIPVisionModel - forward

CLIPForImageClassification

[[autodoc]] CLIPForImageClassification - forward

TFCLIPModel

[[autodoc]] TFCLIPModel - call - get_text_features - get_image_features

TFCLIPTextModel

[[autodoc]] TFCLIPTextModel - call

TFCLIPVisionModel

[[autodoc]] TFCLIPVisionModel - call

FlaxCLIPModel

[[autodoc]] FlaxCLIPModel - call - get_text_features - get_image_features

FlaxCLIPTextModel

[[autodoc]] FlaxCLIPTextModel - call

FlaxCLIPTextModelWithProjection

[[autodoc]] FlaxCLIPTextModelWithProjection - call

FlaxCLIPVisionModel

[[autodoc]] FlaxCLIPVisionModel - call

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clip.md

clip.md

CLIP

Notes

CLIPConfig

CLIPTextConfig

CLIPVisionConfig

CLIPTokenizer

CLIPTokenizerFast

CLIPImageProcessor

CLIPImageProcessorFast

CLIPFeatureExtractor

CLIPProcessor

CLIPModel

CLIPTextModel

CLIPTextModelWithProjection

CLIPVisionModelWithProjection

CLIPVisionModel

CLIPForImageClassification

TFCLIPModel

TFCLIPTextModel

TFCLIPVisionModel

FlaxCLIPModel

FlaxCLIPTextModel

FlaxCLIPTextModelWithProjection

FlaxCLIPVisionModel

Files

clip.md

Latest commit

History

clip.md

File metadata and controls

CLIP

Notes

CLIPConfig

CLIPTextConfig

CLIPVisionConfig

CLIPTokenizer

CLIPTokenizerFast

CLIPImageProcessor

CLIPImageProcessorFast

CLIPFeatureExtractor

CLIPProcessor

CLIPModel

CLIPTextModel

CLIPTextModelWithProjection

CLIPVisionModelWithProjection

CLIPVisionModel

CLIPForImageClassification

TFCLIPModel

TFCLIPTextModel

TFCLIPVisionModel

FlaxCLIPModel

FlaxCLIPTextModel

FlaxCLIPTextModelWithProjection

FlaxCLIPVisionModel