CLIP ViT-B/32 DataComp-1B
laion/CLIP-ViT-B-32-DataComp.XL-s13B-b90K
published Sep 2023 · updated Sep 2023
CLIP ViT-B/32 DataComp-1B is a zero-shot-image model that performs image classification and image-text retrieval using a Vision Transformer backbone trained on the DataComp-1B dataset.
specs
| Task | Zero-shot image classification, image-text retrieval |
| Architecture | ViT-B/32 |
| Parameters | ~150M |
| License | MIT (OpenCLIP) |
about this model
CLIP ViT-B-32-DataComp.XL-s13B-b90K is a zero-shot image classification model that performs image-text retrieval and arbitrary image classification without task-specific training. It is a Vision Transformer (ViT-B/32) model trained using OpenCLIP on the DataComp-1B dataset, which comprises 1.4 billion image-text pairs sourced from Common Crawl.
Training Data and Procedure
The model was trained on the DataComp-1B dataset, part of the DataComp benchmark designed to study the impact of dataset curation on multimodal model performance. Training used the standard CLIP objective and was performed on the Stability AI cluster. The DataComp paper (arXiv:2304.14108) provides full details of the training procedure and dataset composition.
Performance
The model achieves a 72.7% zero-shot top-1 accuracy on ImageNet-1k. Evaluation was conducted across 38 diverse downstream datasets using the DataComp and LAION CLIP Benchmark repositories. This result demonstrates the effectiveness of the DataComp-1B dataset curation approach for training competitive zero-shot vision models.
Capabilities
As a zero-shot image model, it can classify images into arbitrary categories defined by text prompts, retrieve images from text queries, and support fine-tuning or linear probe tasks. The model is hosted on gigarouter as a managed, OpenAI-compatible API, enabling direct integration without infrastructure setup.
best for
- ·Zero-shot image classification on arbitrary categories
- ·Image-text retrieval and search
- ·Linear probe fine-tuning for custom classification tasks
FAQ
It achieves 72.7% top-1 accuracy on ImageNet-1k.
It uses a ViT-B/32 Vision Transformer with a 32x32 patch size.
It was trained on the DataComp-1B dataset, which contains 1.4 billion image-text pairs.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and text prompt for zero-shot classification or retrieval.
The model uses the MIT license from OpenCLIP.
We're benchmarking and onboarding CLIP ViT-B/32 DataComp-1B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.