CLIP ViT-B/32

openai/clip-vit-base-patch32

published Mar 2022 · updated Feb 2024

CLIP ViT-B/32 is a zero-shot image classification model that learns visual concepts from natural language supervision, enabling it to classify images into arbitrary categories without task-specific training.

status

coming soon

API providers

downloads / mo

22.3M

specs

Task	Zero-Shot Image Classification
Architecture	ViT-B/32 image encoder + masked self-attention Transformer text encoder
Training Data	400 million (image, text) pairs from the internet

about this model

CLIP (Contrastive Language-Image Pre-Training) is a zero-shot image classification model that learns visual concepts from natural language supervision, enabling classification of images into arbitrary categories without task-specific training data.

Architecture

The model uses a ViT-B/32 Vision Transformer as the image encoder and a masked self-attention Transformer as the text encoder. Both encoders are jointly trained via a contrastive loss to maximize the similarity of correct (image, text) pairs. The resulting representations allow the model to perform zero-shot classification by computing the similarity between an image and a set of candidate text descriptions.

Training Data

CLIP was trained on 400 million (image, text) pairs collected from publicly available internet sources, including crawled websites and existing datasets such as YFCC100M. This large-scale, diverse dataset supports strong generalization across many visual domains.

Performance

On ImageNet, CLIP matches the accuracy of the original ResNet-50 in a zero-shot setting without using any of the 1.28 million labeled training examples. The model has been evaluated on over 30 computer vision benchmarks spanning OCR, action recognition, geo-localization, fine-grained classification, and more. Key results include:

Gender classification on FairFace: >96% accuracy across all race groups, with Middle Eastern highest (98.4%) and White lowest (96.5%).
Racial classification on FairFace: ~93% average accuracy.
Age classification on FairFace: ~63% average accuracy.

Limitations and Bias

CLIP struggles with fine-grained classification, object counting, and tasks requiring precise spatial reasoning. Performance varies significantly with class taxonomy design, and the model exhibits racial and gender biases—for example, disparities in crime-related and non-human animal classification across demographics. These limitations should be carefully evaluated before any deployment.

best for

·Zero-shot classification of images into arbitrary categories
·Image-text similarity search and retrieval
·Multimodal applications like visual question answering

FAQ

What is CLIP ViT-B/32 best used for?

It is best for zero-shot image classification and image-text similarity tasks without needing task-specific training data.

How does CLIP ViT-B/32 work?

It uses contrastive pre-training to align image and text embeddings, then can zero-shot classify images by comparing them to text descriptions.

What are the input and output formats?

Input: an image and a list of text prompts. Output: similarity scores or class probabilities.

What are the limitations of this model?

It struggles with fine-grained classification, counting objects, and exhibits biases in race/gender/age classification.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key, sending image and text inputs as specified in the documentation.

not yet live

We're benchmarking and onboarding CLIP ViT-B/32 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336

siglip-so400m-patch14-384

1.8M dl/mo