CLIP ViT-L/14 LAION-2B
laion/CLIP-ViT-L-14-laion2B-s32B-b82K
published Sep 2022 · updated Jan 2024
CLIP ViT-L/14 LAION-2B is a zero-shot-image model that learns visual concepts from natural language supervision, enabling image classification and retrieval without task-specific training.
specs
| Task | Zero-shot image classification and image-text retrieval |
| Architecture | ViT-L/14 with 428M parameters |
| Parameters | 428M |
| License | MIT |
about this model
laion/CLIP-ViT-L-14-laion2B-s32B-b82K is a zero-shot image classification model that performs contrastive learning between images and natural language text, enabling it to classify images into arbitrary categories without task-specific fine-tuning.
The model is a CLIP ViT-L/14 variant trained on the 2-billion-sample English subset of LAION-5B using OpenCLIP. Training was conducted on 384 A100 GPUs over 160 virtual epochs, processing a total of 32 billion samples. Early training encountered instability in float16 AMP; switching to float32 precision (with tf32 matmuls) resolved the issue and training continued with a global batch size of 86k. The model achieves a 75.3% zero-shot top-1 accuracy on ImageNet-1k, placing it among the strongest open CLIP models for the ViT-L/14 architecture.
Key Capabilities
- Zero-shot image classification: classify images using arbitrary natural-language labels without retraining.
- Image and text retrieval: search for images via text queries or vice versa.
- Strong generalization across diverse visual domains, as evaluated on the VTAB+ benchmark suite (including robustness datasets) and COCO/Flickr for retrieval tasks.
Training and Data
- Training data: LAION-2B English subset (2 billion image-text pairs from public web crawls). The dataset is uncurated; research use is recommended.
- Hardware: 96 nodes × 4× A100 GPUs on the JUWELS Booster supercomputer (Jülich Supercomputing Centre), funded by the Gauss Centre for Supercomputing.
- Architecture: ViT-L/14 (224×224 resolution, 14×14 patch size) with a text transformer encoder.
This model is especially suited for applications requiring flexible, prompt-based classification or retrieval without the overhead of training custom classifiers. It is hosted as a managed, OpenAI-compatible API on gigarouter for easy integration.
best for
- ·Zero-shot image classification on arbitrary categories
- ·Image-text retrieval (search images by text or vice versa)
- ·Fine-tuning for downstream image tasks like linear probe classification
FAQ
It is best for zero-shot image classification and image-text retrieval without task-specific training data.
The API accepts image URLs or base64-encoded images and text prompts for classification or retrieval.
Use the OpenAI-compatible endpoint with your API key and specify the model name CLIP-ViT-L-14-laion2B-s32B-b82K.
The model is released under the MIT license.
It was trained on the LAION-2B dataset using OpenCLIP and achieves 75.3% zero-shot top-1 accuracy on ImageNet-1k, comparable to the original.
We're benchmarking and onboarding CLIP ViT-L/14 LAION-2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.