CLIP ViT-B/32 DataComp-1B

laion/CLIP-ViT-B-32-DataComp.XL-s13B-b90K

published Sep 2023 · updated Sep 2023

CLIP ViT-B/32 DataComp-1B is a zero-shot-image model that performs image classification and image-text retrieval using a Vision Transformer backbone trained on the DataComp-1B dataset.

status

coming soon

API providers

downloads / mo

47.1K

license

mit

specs

Task	Zero-shot image classification, image-text retrieval
Architecture	ViT-B/32
Parameters	~150M
License	MIT (OpenCLIP)

about this model

CLIP ViT-B-32-DataComp.XL-s13B-b90K is a zero-shot image classification model that performs image-text retrieval and arbitrary image classification without task-specific training. It is a Vision Transformer (ViT-B/32) model trained using OpenCLIP on the DataComp-1B dataset, which comprises 1.4 billion image-text pairs sourced from Common Crawl.

Training Data and Procedure

The model was trained on the DataComp-1B dataset, part of the DataComp benchmark designed to study the impact of dataset curation on multimodal model performance. Training used the standard CLIP objective and was performed on the Stability AI cluster. The DataComp paper (arXiv:2304.14108) provides full details of the training procedure and dataset composition.

Performance

The model achieves a 72.7% zero-shot top-1 accuracy on ImageNet-1k. Evaluation was conducted across 38 diverse downstream datasets using the DataComp and LAION CLIP Benchmark repositories. This result demonstrates the effectiveness of the DataComp-1B dataset curation approach for training competitive zero-shot vision models.

Capabilities

As a zero-shot image model, it can classify images into arbitrary categories defined by text prompts, retrieve images from text queries, and support fine-tuning or linear probe tasks. The model is hosted on gigarouter as a managed, OpenAI-compatible API, enabling direct integration without infrastructure setup.

best for

·Zero-shot image classification on arbitrary categories
·Image-text retrieval and search
·Linear probe fine-tuning for custom classification tasks

FAQ

What is the model's zero-shot accuracy on ImageNet?

It achieves 72.7% top-1 accuracy on ImageNet-1k.

What architecture does this model use?

It uses a ViT-B/32 Vision Transformer with a 32x32 patch size.

What dataset was it trained on?

It was trained on the DataComp-1B dataset, which contains 1.4 billion image-text pairs.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and text prompt for zero-shot classification or retrieval.

What is the license for this model?

The model uses the MIT license from OpenCLIP.

not yet live

We're benchmarking and onboarding CLIP ViT-B/32 DataComp-1B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336