CLIP ViT-B/32
openai/clip-vit-base-patch32
published Mar 2022 · updated Feb 2024
CLIP ViT-B/32 is a zero-shot image classification model that learns visual concepts from natural language supervision, enabling it to classify images into arbitrary categories without task-specific training.
specs
| Task | Zero-Shot Image Classification |
| Architecture | ViT-B/32 image encoder + masked self-attention Transformer text encoder |
| Training Data | 400 million (image, text) pairs from the internet |
about this model
Architecture
The model uses a ViT-B/32 Vision Transformer as the image encoder and a masked self-attention Transformer as the text encoder. Both encoders are jointly trained via a contrastive loss to maximize the similarity of correct (image, text) pairs. The resulting representations allow the model to perform zero-shot classification by computing the similarity between an image and a set of candidate text descriptions.
Training Data
CLIP was trained on 400 million (image, text) pairs collected from publicly available internet sources, including crawled websites and existing datasets such as YFCC100M. This large-scale, diverse dataset supports strong generalization across many visual domains.
Performance
On ImageNet, CLIP matches the accuracy of the original ResNet-50 in a zero-shot setting without using any of the 1.28 million labeled training examples. The model has been evaluated on over 30 computer vision benchmarks spanning OCR, action recognition, geo-localization, fine-grained classification, and more. Key results include:
- Gender classification on FairFace: >96% accuracy across all race groups, with Middle Eastern highest (98.4%) and White lowest (96.5%).
- Racial classification on FairFace: ~93% average accuracy.
- Age classification on FairFace: ~63% average accuracy.
Limitations and Bias
CLIP struggles with fine-grained classification, object counting, and tasks requiring precise spatial reasoning. Performance varies significantly with class taxonomy design, and the model exhibits racial and gender biases—for example, disparities in crime-related and non-human animal classification across demographics. These limitations should be carefully evaluated before any deployment.
best for
- ·Zero-shot classification of images into arbitrary categories
- ·Image-text similarity search and retrieval
- ·Multimodal applications like visual question answering
FAQ
It is best for zero-shot image classification and image-text similarity tasks without needing task-specific training data.
It uses contrastive pre-training to align image and text embeddings, then can zero-shot classify images by comparing them to text descriptions.
Input: an image and a list of text prompts. Output: similarity scores or class probabilities.
It struggles with fine-grained classification, counting objects, and exhibits biases in race/gender/age classification.
Use the gigarouter OpenAI-compatible endpoint with an API key, sending image and text inputs as specified in the documentation.
We're benchmarking and onboarding CLIP ViT-B/32 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.