skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

CLIP ViT-bigG/14

laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

published Jan 2023 · updated Jan 2025

CLIP ViT-bigG/14 is a zero-shot image model that maps images and text to a shared embedding space for classification and retrieval without task-specific training.

status
coming soon
API providers
0
downloads / mo
100.7K
license
mit

specs

TaskZero-shot Image Classification & Retrieval
ArchitectureViT-bigG/14 Transformer
Parameters2.54B
LicenseMIT

about this model

CLIP ViT-bigG/14 is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification and retrieval without task-specific fine-tuning. It was trained on the LAION-2B English subset of the LAION-5B dataset (approximately 2 billion image-text pairs) using OpenCLIP, with additional fine-tuning on a 900M-sample subset filtered by aesthetic score and deduplicated (LAION-A).

Model architecture

The model uses a Vision Transformer (ViT) with a patch size of 14 and a “bigG” scale, totaling 2,539.57 million parameters and 1,065.36 billion FLOPs.

Evaluation results

On ImageNet-1k, the model achieves 80.1% zero-shot top-1 accuracy. Its average zero-shot top-1 accuracy across 38 diverse datasets (VTAB+ benchmark suite) is 66.67%. Representative per-dataset results include:

DatasetZero-shot top-1 accuracy
ImageNet-1k80.1%
ImageNet-R92.13%
ImageNet-A69.33%
ImageNet v273.59%
ObjectNet72.84%
CIFAR-10087.52%
Stanford Cars94.60%
Oxford-IIIT Pet95.29%
Food-10193.09%
EuroSAT69.19%
MSCOCO retrieval (image-to-text)59.38%
Flickr retrieval (image-to-text)86.23%

Additional benchmarks cover visual reasoning, texture, satellite imagery, and medical domains (e.g., PatchCamelyon, Camelyon17, iWildCam). The model is intended for research use and supports zero-shot classification, image-text retrieval, and linear probe or fine-tuning for downstream tasks.

best for

FAQ

What is CLIP ViT-bigG/14 best used for?

It excels at zero-shot image classification and image-text retrieval, and can be fine-tuned for custom vision tasks.

How many parameters does this model have?

It has approximately 2.54 billion parameters.

What is the license for this model?

The model is released under the MIT license.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Pass an image and a list of text prompts to get similarity scores.

What input formats does the model accept?

It accepts images (URL or base64) and text prompts. The output is a similarity score for each text prompt.

not yet live

We're benchmarking and onboarding CLIP ViT-bigG/14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →