CLIP ViT-B/32 (LAION-2B)

laion/CLIP-ViT-B-32-laion2B-s34B-b79K

published Sep 2022 · updated Jan 2025

CLIP ViT-B/32 (LAION-2B) is a zero-shot image model that can classify images and retrieve images or text by learning visual concepts from natural language.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

license

mit

specs

Task	Zero-shot image classification and retrieval
Architecture	ViT-B/32 (Vision Transformer)
Training Data	LAION-2B English subset (2B image-text pairs)
Top-1 Accuracy (ImageNet-1k)	66.6%

about this model

laion/CLIP-ViT-B-32-laion2B-s34B-b79K is a zero-shot image classification model based on the CLIP ViT-B/32 architecture. It is trained on the 2-billion-sample English subset of LAION-5B (LAION-2B) using OpenCLIP, with training compute provided by stability.ai.

Capabilities

The model performs zero-shot image classification by matching an image against arbitrary text prompts, without task-specific fine-tuning. It also supports image-text retrieval and can serve as a foundation for downstream tasks such as linear probe classification or image generation conditioning.

Benchmark Performance

On ImageNet-1k, the model achieves 66.6% zero-shot top-1 accuracy. Evaluated across the VTAB+ benchmark suite (a combination of the Visual Task Adaptation Benchmark and additional robustness datasets) and on COCO and Flickr for retrieval. Detailed per-dataset results are available in the LAION CLIP Benchmark repository.

Training and Dataset

The model was trained by Romain Beaumont on the LAION-2B dataset, an uncurated web-scale collection of English image-text pairs. As a research output, users should be aware of the dataset’s uncurated nature; a safety-filtered subset is available. The model is intended for research and constrained experimental use, not for untested commercial deployment.

Attribution

If using this model, cite the OpenAI CLIP paper, the OpenCLIP software, and the forthcoming LAION-5B paper (links provided in the full model card).

best for

·Zero-shot image classification without fine-tuning
·Image-text retrieval (e.g., search images by text descriptions)
·Linear probe classification for custom datasets

FAQ

What is the input format for this model?

It accepts text prompts and images as input. For classification, provide candidate class names and an image; the model outputs similarity scores.

How do I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Send a request with model name "CLIP-ViT-B-32-laion2B-s34B-b79K" and appropriate inputs.

What is the model's accuracy on ImageNet?

It achieves 66.6% zero-shot top-1 accuracy on ImageNet-1k.

Can I use this model for commercial applications?

According to its model card, any deployed use case, commercial or not, is currently out of scope. It is intended as a research output.

not yet live

We're benchmarking and onboarding CLIP ViT-B/32 (LAION-2B) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

clip-vit-large-patch14-336

siglip-so400m-patch14-384

1.8M dl/mo