skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

CLIP ViT-L/14 LAION-2B

laion/CLIP-ViT-L-14-laion2B-s32B-b82K

published Sep 2022 · updated Jan 2024

CLIP ViT-L/14 LAION-2B is a zero-shot-image model that learns visual concepts from natural language supervision, enabling image classification and retrieval without task-specific training.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
373.8K
license
mit

specs

TaskZero-shot image classification and image-text retrieval
ArchitectureViT-L/14 with 428M parameters
Parameters428M
LicenseMIT

about this model

laion/CLIP-ViT-L-14-laion2B-s32B-b82K is a zero-shot image classification model that performs contrastive learning between images and natural language text, enabling it to classify images into arbitrary categories without task-specific fine-tuning.

The model is a CLIP ViT-L/14 variant trained on the 2-billion-sample English subset of LAION-5B using OpenCLIP. Training was conducted on 384 A100 GPUs over 160 virtual epochs, processing a total of 32 billion samples. Early training encountered instability in float16 AMP; switching to float32 precision (with tf32 matmuls) resolved the issue and training continued with a global batch size of 86k. The model achieves a 75.3% zero-shot top-1 accuracy on ImageNet-1k, placing it among the strongest open CLIP models for the ViT-L/14 architecture.

Key Capabilities

  • Zero-shot image classification: classify images using arbitrary natural-language labels without retraining.
  • Image and text retrieval: search for images via text queries or vice versa.
  • Strong generalization across diverse visual domains, as evaluated on the VTAB+ benchmark suite (including robustness datasets) and COCO/Flickr for retrieval tasks.

Training and Data

  • Training data: LAION-2B English subset (2 billion image-text pairs from public web crawls). The dataset is uncurated; research use is recommended.
  • Hardware: 96 nodes × 4× A100 GPUs on the JUWELS Booster supercomputer (Jülich Supercomputing Centre), funded by the Gauss Centre for Supercomputing.
  • Architecture: ViT-L/14 (224×224 resolution, 14×14 patch size) with a text transformer encoder.

This model is especially suited for applications requiring flexible, prompt-based classification or retrieval without the overhead of training custom classifiers. It is hosted as a managed, OpenAI-compatible API on gigarouter for easy integration.

best for

FAQ

What is this model best used for?

It is best for zero-shot image classification and image-text retrieval without task-specific training data.

What input formats does the API accept?

The API accepts image URLs or base64-encoded images and text prompts for classification or retrieval.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key and specify the model name CLIP-ViT-L-14-laion2B-s32B-b82K.

What is the model's license?

The model is released under the MIT license.

How does this model compare to the original OpenAI CLIP ViT-L/14?

It was trained on the LAION-2B dataset using OpenCLIP and achieves 75.3% zero-shot top-1 accuracy on ImageNet-1k, comparable to the original.

not yet live

We're benchmarking and onboarding CLIP ViT-L/14 LAION-2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →