CLIP ViT-bigG/14
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
published Jan 2023 · updated Jan 2025
CLIP ViT-bigG/14 is a zero-shot image model that maps images and text to a shared embedding space for classification and retrieval without task-specific training.
specs
| Task | Zero-shot Image Classification & Retrieval |
| Architecture | ViT-bigG/14 Transformer |
| Parameters | 2.54B |
| License | MIT |
about this model
CLIP ViT-bigG/14 is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification and retrieval without task-specific fine-tuning. It was trained on the LAION-2B English subset of the LAION-5B dataset (approximately 2 billion image-text pairs) using OpenCLIP, with additional fine-tuning on a 900M-sample subset filtered by aesthetic score and deduplicated (LAION-A).
Model architecture
The model uses a Vision Transformer (ViT) with a patch size of 14 and a “bigG” scale, totaling 2,539.57 million parameters and 1,065.36 billion FLOPs.
Evaluation results
On ImageNet-1k, the model achieves 80.1% zero-shot top-1 accuracy. Its average zero-shot top-1 accuracy across 38 diverse datasets (VTAB+ benchmark suite) is 66.67%. Representative per-dataset results include:
| Dataset | Zero-shot top-1 accuracy |
|---|---|
| ImageNet-1k | 80.1% |
| ImageNet-R | 92.13% |
| ImageNet-A | 69.33% |
| ImageNet v2 | 73.59% |
| ObjectNet | 72.84% |
| CIFAR-100 | 87.52% |
| Stanford Cars | 94.60% |
| Oxford-IIIT Pet | 95.29% |
| Food-101 | 93.09% |
| EuroSAT | 69.19% |
| MSCOCO retrieval (image-to-text) | 59.38% |
| Flickr retrieval (image-to-text) | 86.23% |
Additional benchmarks cover visual reasoning, texture, satellite imagery, and medical domains (e.g., PatchCamelyon, Camelyon17, iWildCam). The model is intended for research use and supports zero-shot classification, image-text retrieval, and linear probe or fine-tuning for downstream tasks.
best for
- ·Zero-shot classification on arbitrary image categories
- ·Image-text retrieval and similarity search
- ·Fine-tuning or linear probing for downstream vision tasks
FAQ
It excels at zero-shot image classification and image-text retrieval, and can be fine-tuned for custom vision tasks.
It has approximately 2.54 billion parameters.
The model is released under the MIT license.
Use the gigarouter OpenAI-compatible endpoint with your API key. Pass an image and a list of text prompts to get similarity scores.
It accepts images (URL or base64) and text prompts. The output is a similarity score for each text prompt.
We're benchmarking and onboarding CLIP ViT-bigG/14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.