skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

CLIP ViT-B/32 DataComp-1B

laion/CLIP-ViT-B-32-DataComp.XL-s13B-b90K

published Sep 2023 · updated Sep 2023

CLIP ViT-B/32 DataComp-1B is a zero-shot-image model that performs image classification and image-text retrieval using a Vision Transformer backbone trained on the DataComp-1B dataset.

status
coming soon
API providers
0
downloads / mo
47.1K
license
mit

specs

TaskZero-shot image classification, image-text retrieval
ArchitectureViT-B/32
Parameters~150M
LicenseMIT (OpenCLIP)

about this model

CLIP ViT-B-32-DataComp.XL-s13B-b90K is a zero-shot image classification model that performs image-text retrieval and arbitrary image classification without task-specific training. It is a Vision Transformer (ViT-B/32) model trained using OpenCLIP on the DataComp-1B dataset, which comprises 1.4 billion image-text pairs sourced from Common Crawl.

Training Data and Procedure

The model was trained on the DataComp-1B dataset, part of the DataComp benchmark designed to study the impact of dataset curation on multimodal model performance. Training used the standard CLIP objective and was performed on the Stability AI cluster. The DataComp paper (arXiv:2304.14108) provides full details of the training procedure and dataset composition.

Performance

The model achieves a 72.7% zero-shot top-1 accuracy on ImageNet-1k. Evaluation was conducted across 38 diverse downstream datasets using the DataComp and LAION CLIP Benchmark repositories. This result demonstrates the effectiveness of the DataComp-1B dataset curation approach for training competitive zero-shot vision models.

Capabilities

As a zero-shot image model, it can classify images into arbitrary categories defined by text prompts, retrieve images from text queries, and support fine-tuning or linear probe tasks. The model is hosted on gigarouter as a managed, OpenAI-compatible API, enabling direct integration without infrastructure setup.

best for

FAQ

What is the model's zero-shot accuracy on ImageNet?

It achieves 72.7% top-1 accuracy on ImageNet-1k.

What architecture does this model use?

It uses a ViT-B/32 Vision Transformer with a 32x32 patch size.

What dataset was it trained on?

It was trained on the DataComp-1B dataset, which contains 1.4 billion image-text pairs.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and text prompt for zero-shot classification or retrieval.

What is the license for this model?

The model uses the MIT license from OpenCLIP.

not yet live

We're benchmarking and onboarding CLIP ViT-B/32 DataComp-1B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →