CLIP ViT-B/16

openai/clip-vit-base-patch16

published Mar 2022 · updated Oct 2022

CLIP ViT-B/16 is a zero-shot image classification model that learns visual concepts from natural language supervision using contrastive learning.

status

coming soon

API providers

downloads / mo

1.6M

specs

Task	Zero-Shot Image Classification / Image-Text Similarity
Architecture	ViT-B/16 image encoder + masked self-attention text encoder
Training Data	400 million (image, text) pairs
Language	English

about this model

CLIP (Contrastive Language-Image Pre-Training) is a zero-shot image classification model that uses a ViT-B/16 image encoder and a masked self-attention text encoder trained via contrastive learning to maximize the similarity of (image, text) pairs. It enables classification of images into arbitrary categories described in natural language without any dataset-specific fine-tuning.

Key Capabilities

Trained on 400 million (image, text) pairs collected from the internet, CLIP learns transferable visual concepts that generalize to a wide range of downstream tasks. The model can be prompted with text descriptions (e.g., "a photo of a cat" or "a satellite image") to predict the most relevant label for a given image, achieving competitive zero-shot performance across over 30 computer vision benchmarks spanning OCR, action recognition, geo-localization, and fine-grained object classification.

Benchmark Results

On ImageNet, CLIP matches the accuracy of the original ResNet-50 in a zero-shot setting without using any of the 1.28 million labeled training examples. Evaluated on the FairFace dataset for demographic analysis:

Gender classification: >96% accuracy across all races (highest 98.4% for Middle Eastern, lowest 96.5% for White)
Racial classification: ~93% accuracy
Age classification: ~63% accuracy

Known Limitations

CLIP struggles with fine-grained classification and object counting. Performance can vary significantly based on class taxonomy design, and bias and fairness disparities have been observed, particularly in the classification of people. The model was trained primarily on English-language data and is intended as a research tool; untested deployment in commercial or surveillance applications is discouraged.

best for

·Zero-shot classification of images into arbitrary categories without training
·Image search by natural language description
·Content moderation by matching images to predefined text labels

FAQ

What is CLIP ViT-B/16 best used for?

It is best for zero-shot image classification and image-text similarity tasks without requiring fine-tuning on labeled data.

How do I call this model via the gigarouter API?

Send requests to the OpenAI-compatible endpoint using your API key. Provide an image URL and a list of text prompts to get similarity scores.

What input formats does CLIP ViT-B/16 accept?

It accepts images (as URLs or file uploads) and text prompts. The model computes similarity between the image and each text prompt.

What is the architecture of CLIP ViT-B/16?

It uses a Vision Transformer (ViT-B/16) as image encoder and a Transformer with masked self-attention as text encoder, trained via contrastive loss.

Is CLIP ViT-B/16 suitable for commercial deployment?

The model card states that any deployed use case is currently out of scope due to safety concerns. Use on gigarouter is recommended for research and evaluation.

not yet live

We're benchmarking and onboarding CLIP ViT-B/16 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336