CLIP ViT-B/16 LAION-2B
laion/CLIP-ViT-B-16-laion2B-s34B-b88K
published Jan 2023 · updated Apr 2023
CLIP ViT-B/16 LAION-2B is a zero-shot-image model that performs image classification and image-text retrieval by learning visual concepts from natural language supervision.
specs
| Task | Zero-shot image classification, image and text retrieval |
| Architecture | ViT-B/16 (Vision Transformer with patch size 16) |
| Parameters | ~150M (ViT-B/16) |
| License | MIT (OpenCLIP) |
about this model
laion/CLIP-ViT-B-16-laion2B-s34B-b88K is a zero-shot image classification model that performs contrastive language-image pre-training (CLIP) using a ViT-B/16 vision encoder and a text encoder, trained on the 2-billion-sample English subset of LAION-5B with OpenCLIP. The model is hosted as a managed API on gigarouter, providing OpenAI-compatible endpoints for zero-shot inference.
Key Capabilities
- Zero-shot image classification: assign labels to images without task-specific fine-tuning by matching image embeddings to text prompt embeddings.
- Image and text retrieval: rank images by text queries or vice versa using cosine similarity between embeddings.
- Downstream adaptability: can be fine-tuned for image classification, used as a frozen feature extractor for linear probing, or employed for image generation guidance and conditioning.
Training and Data
The model was trained by Mehdi Cherti on the JUWELS Booster supercomputer at Jülich Supercomputing Centre. Training data is the English subset of LAION-5B, an uncurated dataset of 2 billion image-text pairs crawled from public internet. The dataset is intended for research; users should be aware of potential harmful content.
Benchmark Results
The model achieves a 70.2% zero-shot top-1 accuracy on ImageNet-1k. Evaluation was performed using the LAION CLIP Benchmark suite on VTAB+ (Visual Task Adaptation Benchmark with additional robustness datasets) for classification and on COCO and Flickr for retrieval. Extended benchmark results are available in the CLIP benchmark repository.
Usage Notes
As per the original OpenAI CLIP model card, this model is a research output intended for research communities. Deployed or unconstrained production use is not recommended without thorough domain-specific testing. Use in surveillance or facial recognition is out of scope. The model is designed for English-language inputs only.
best for
- ·Zero-shot image classification without task-specific training
- ·Image and text retrieval (e.g., searching images by natural language descriptions)
- ·Fine-tuning for downstream image classification tasks
FAQ
It is designed for zero-shot image classification and image-text retrieval, allowing you to classify or search images using natural language without task-specific training.
The model achieves 70.2% zero-shot top-1 accuracy on ImageNet-1k.
The model is released under the MIT license via OpenCLIP.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image and text prompt for classification or retrieval.
It was trained on the 2 billion sample English subset of LAION-5B.
We're benchmarking and onboarding CLIP ViT-B/16 LAION-2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.