CLIP ViT-L/14 LAION-2B

laion/CLIP-ViT-L-14-laion2B-s32B-b82K

published Sep 2022 · updated Jan 2024

CLIP ViT-L/14 LAION-2B is a zero-shot-image model that learns visual concepts from natural language supervision, enabling image classification and retrieval without task-specific training.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

373.8K

license

mit

specs

Task	Zero-shot image classification and image-text retrieval
Architecture	ViT-L/14 with 428M parameters
Parameters	428M
License	MIT

about this model

laion/CLIP-ViT-L-14-laion2B-s32B-b82K is a zero-shot image classification model that performs contrastive learning between images and natural language text, enabling it to classify images into arbitrary categories without task-specific fine-tuning.

The model is a CLIP ViT-L/14 variant trained on the 2-billion-sample English subset of LAION-5B using OpenCLIP. Training was conducted on 384 A100 GPUs over 160 virtual epochs, processing a total of 32 billion samples. Early training encountered instability in float16 AMP; switching to float32 precision (with tf32 matmuls) resolved the issue and training continued with a global batch size of 86k. The model achieves a 75.3% zero-shot top-1 accuracy on ImageNet-1k, placing it among the strongest open CLIP models for the ViT-L/14 architecture.

Key Capabilities

Zero-shot image classification: classify images using arbitrary natural-language labels without retraining.
Image and text retrieval: search for images via text queries or vice versa.
Strong generalization across diverse visual domains, as evaluated on the VTAB+ benchmark suite (including robustness datasets) and COCO/Flickr for retrieval tasks.

Training and Data

Training data: LAION-2B English subset (2 billion image-text pairs from public web crawls). The dataset is uncurated; research use is recommended.
Hardware: 96 nodes × 4× A100 GPUs on the JUWELS Booster supercomputer (Jülich Supercomputing Centre), funded by the Gauss Centre for Supercomputing.
Architecture: ViT-L/14 (224×224 resolution, 14×14 patch size) with a text transformer encoder.

This model is especially suited for applications requiring flexible, prompt-based classification or retrieval without the overhead of training custom classifiers. It is hosted as a managed, OpenAI-compatible API on gigarouter for easy integration.

best for

·Zero-shot image classification on arbitrary categories
·Image-text retrieval (search images by text or vice versa)
·Fine-tuning for downstream image tasks like linear probe classification

FAQ

What is this model best used for?

It is best for zero-shot image classification and image-text retrieval without task-specific training data.

What input formats does the API accept?

The API accepts image URLs or base64-encoded images and text prompts for classification or retrieval.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key and specify the model name CLIP-ViT-L-14-laion2B-s32B-b82K.

What is the model's license?

The model is released under the MIT license.

How does this model compare to the original OpenAI CLIP ViT-L/14?

It was trained on the LAION-2B dataset using OpenCLIP and achieves 75.3% zero-shot top-1 accuracy on ImageNet-1k, comparable to the original.

not yet live

We're benchmarking and onboarding CLIP ViT-L/14 LAION-2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336