CLIP ViT-bigG/14

laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

published Jan 2023 · updated Jan 2025

CLIP ViT-bigG/14 is a zero-shot image model that maps images and text to a shared embedding space for classification and retrieval without task-specific training.

status

coming soon

API providers

downloads / mo

100.7K

license

mit

specs

Task	Zero-shot Image Classification & Retrieval
Architecture	ViT-bigG/14 Transformer
Parameters	2.54B
License	MIT

about this model

CLIP ViT-bigG/14 is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification and retrieval without task-specific fine-tuning. It was trained on the LAION-2B English subset of the LAION-5B dataset (approximately 2 billion image-text pairs) using OpenCLIP, with additional fine-tuning on a 900M-sample subset filtered by aesthetic score and deduplicated (LAION-A).

Model architecture

The model uses a Vision Transformer (ViT) with a patch size of 14 and a “bigG” scale, totaling 2,539.57 million parameters and 1,065.36 billion FLOPs.

Evaluation results

On ImageNet-1k, the model achieves 80.1% zero-shot top-1 accuracy. Its average zero-shot top-1 accuracy across 38 diverse datasets (VTAB+ benchmark suite) is 66.67%. Representative per-dataset results include:

Dataset	Zero-shot top-1 accuracy
ImageNet-1k	80.1%
ImageNet-R	92.13%
ImageNet-A	69.33%
ImageNet v2	73.59%
ObjectNet	72.84%
CIFAR-100	87.52%
Stanford Cars	94.60%
Oxford-IIIT Pet	95.29%
Food-101	93.09%
EuroSAT	69.19%
MSCOCO retrieval (image-to-text)	59.38%
Flickr retrieval (image-to-text)	86.23%

Additional benchmarks cover visual reasoning, texture, satellite imagery, and medical domains (e.g., PatchCamelyon, Camelyon17, iWildCam). The model is intended for research use and supports zero-shot classification, image-text retrieval, and linear probe or fine-tuning for downstream tasks.

best for

·Zero-shot classification on arbitrary image categories
·Image-text retrieval and similarity search
·Fine-tuning or linear probing for downstream vision tasks

FAQ

What is CLIP ViT-bigG/14 best used for?

It excels at zero-shot image classification and image-text retrieval, and can be fine-tuned for custom vision tasks.

How many parameters does this model have?

It has approximately 2.54 billion parameters.

What is the license for this model?

The model is released under the MIT license.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Pass an image and a list of text prompts to get similarity scores.

What input formats does the model accept?

It accepts images (URL or base64) and text prompts. The output is a similarity score for each text prompt.

not yet live

We're benchmarking and onboarding CLIP ViT-bigG/14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336