CLIP ViT-B/16
openai/clip-vit-base-patch16
published Mar 2022 · updated Oct 2022
CLIP ViT-B/16 is a zero-shot image classification model that learns visual concepts from natural language supervision using contrastive learning.
specs
| Task | Zero-Shot Image Classification / Image-Text Similarity |
| Architecture | ViT-B/16 image encoder + masked self-attention text encoder |
| Training Data | 400 million (image, text) pairs |
| Language | English |
about this model
CLIP (Contrastive Language-Image Pre-Training) is a zero-shot image classification model that uses a ViT-B/16 image encoder and a masked self-attention text encoder trained via contrastive learning to maximize the similarity of (image, text) pairs. It enables classification of images into arbitrary categories described in natural language without any dataset-specific fine-tuning.
Key Capabilities
Trained on 400 million (image, text) pairs collected from the internet, CLIP learns transferable visual concepts that generalize to a wide range of downstream tasks. The model can be prompted with text descriptions (e.g., "a photo of a cat" or "a satellite image") to predict the most relevant label for a given image, achieving competitive zero-shot performance across over 30 computer vision benchmarks spanning OCR, action recognition, geo-localization, and fine-grained object classification.
Benchmark Results
On ImageNet, CLIP matches the accuracy of the original ResNet-50 in a zero-shot setting without using any of the 1.28 million labeled training examples. Evaluated on the FairFace dataset for demographic analysis:
- Gender classification: >96% accuracy across all races (highest 98.4% for Middle Eastern, lowest 96.5% for White)
- Racial classification: ~93% accuracy
- Age classification: ~63% accuracy
Known Limitations
CLIP struggles with fine-grained classification and object counting. Performance can vary significantly based on class taxonomy design, and bias and fairness disparities have been observed, particularly in the classification of people. The model was trained primarily on English-language data and is intended as a research tool; untested deployment in commercial or surveillance applications is discouraged.
best for
- ·Zero-shot classification of images into arbitrary categories without training
- ·Image search by natural language description
- ·Content moderation by matching images to predefined text labels
FAQ
It is best for zero-shot image classification and image-text similarity tasks without requiring fine-tuning on labeled data.
Send requests to the OpenAI-compatible endpoint using your API key. Provide an image URL and a list of text prompts to get similarity scores.
It accepts images (as URLs or file uploads) and text prompts. The model computes similarity between the image and each text prompt.
It uses a Vision Transformer (ViT-B/16) as image encoder and a Transformer with masked self-attention as text encoder, trained via contrastive loss.
The model card states that any deployed use case is currently out of scope due to safety concerns. Use on gigarouter is recommended for research and evaluation.
We're benchmarking and onboarding CLIP ViT-B/16 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.