CLIP ViT-B/32 (LAION-2B)
laion/CLIP-ViT-B-32-laion2B-s34B-b79K
published Sep 2022 · updated Jan 2025
CLIP ViT-B/32 (LAION-2B) is a zero-shot image model that can classify images and retrieve images or text by learning visual concepts from natural language.
specs
| Task | Zero-shot image classification and retrieval |
| Architecture | ViT-B/32 (Vision Transformer) |
| Training Data | LAION-2B English subset (2B image-text pairs) |
| Top-1 Accuracy (ImageNet-1k) | 66.6% |
about this model
laion/CLIP-ViT-B-32-laion2B-s34B-b79K is a zero-shot image classification model based on the CLIP ViT-B/32 architecture. It is trained on the 2-billion-sample English subset of LAION-5B (LAION-2B) using OpenCLIP, with training compute provided by stability.ai.
Capabilities
The model performs zero-shot image classification by matching an image against arbitrary text prompts, without task-specific fine-tuning. It also supports image-text retrieval and can serve as a foundation for downstream tasks such as linear probe classification or image generation conditioning.
Benchmark Performance
On ImageNet-1k, the model achieves 66.6% zero-shot top-1 accuracy. Evaluated across the VTAB+ benchmark suite (a combination of the Visual Task Adaptation Benchmark and additional robustness datasets) and on COCO and Flickr for retrieval. Detailed per-dataset results are available in the LAION CLIP Benchmark repository.
Training and Dataset
The model was trained by Romain Beaumont on the LAION-2B dataset, an uncurated web-scale collection of English image-text pairs. As a research output, users should be aware of the dataset’s uncurated nature; a safety-filtered subset is available. The model is intended for research and constrained experimental use, not for untested commercial deployment.
Attribution
If using this model, cite the OpenAI CLIP paper, the OpenCLIP software, and the forthcoming LAION-5B paper (links provided in the full model card).
best for
- ·Zero-shot image classification without fine-tuning
- ·Image-text retrieval (e.g., search images by text descriptions)
- ·Linear probe classification for custom datasets
FAQ
It accepts text prompts and images as input. For classification, provide candidate class names and an image; the model outputs similarity scores.
Use the gigarouter OpenAI-compatible endpoint with your API key. Send a request with model name "CLIP-ViT-B-32-laion2B-s34B-b79K" and appropriate inputs.
It achieves 66.6% zero-shot top-1 accuracy on ImageNet-1k.
According to its model card, any deployed use case, commercial or not, is currently out of scope. It is intended as a research output.
We're benchmarking and onboarding CLIP ViT-B/32 (LAION-2B) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.