CLIP ViT-L-14 DataComp

laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

published Apr 2023 · updated May 2023

CLIP ViT-L-14 DataComp is a zero-shot image model that performs image classification and retrieval by matching images and text using a Vision Transformer backbone trained on the DataComp-1B dataset.

status

coming soon

API providers

downloads / mo

59K

license

mit

specs

Task	Zero-shot image classification, image-text retrieval
Architecture	ViT-L/14
Training Data	DataComp-1B (1.4 billion image-text pairs)
ImageNet Accuracy	79.2% zero-shot top-1

about this model

CLIP ViT-L/14 (DataComp-1B) is a zero-shot image classification and retrieval model that embeds images and text into a shared space, enabling open-vocabulary recognition without task-specific fine-tuning. It uses a Vision Transformer (ViT-L/14) architecture and was trained on 1.4 billion image-text pairs from the DataComp-1B dataset, which was curated as part of the DataComp benchmark for dataset design.

The model achieves a zero-shot top-1 accuracy of 79.2% on ImageNet-1k, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points under the same training procedure and compute budget. This result comes from the standardized training and evaluation pipeline defined by the DataComp benchmark, which tests on 38 diverse downstream datasets spanning classification, retrieval, and other vision-language tasks.

As a hosted API on gigarouter, the model can be used for zero-shot image classification, image‑text retrieval, and as a foundation for linear probing or fine‑tuning. It is a research‑oriented model; deployed use cases require thorough domain‑specific testing due to performance variability across class taxonomies. Surveillance and facial recognition applications are explicitly out of scope.

best for

·Zero-shot image classification with custom categories
·Image-text retrieval for search and recommendation
·Building multimodal search engines

FAQ

What is CLIP ViT-L-14 DataComp?

It is a CLIP model with a ViT-L/14 architecture trained on the DataComp-1B dataset, enabling zero-shot image classification and text-image matching.

How does it compare to OpenAI's CLIP ViT-L/14?

It outperforms OpenAI's CLIP ViT-L/14 by 3.7 percentage points on ImageNet zero-shot accuracy (79.2% vs 75.5%) under the same training procedure and compute budget.

What training data was used?

The model was trained on DataComp-1B, a dataset of 1.4 billion image-text pairs curated from Common Crawl.

How can I use this model via the gigarouter API?

You can call the model using an OpenAI-compatible endpoint. Provide your API key and send image and text inputs as required by the endpoint documentation.

What input formats does it accept?

The model accepts images (as URLs or base64-encoded) and text prompts for zero-shot classification or retrieval tasks.

not yet live

We're benchmarking and onboarding CLIP ViT-L-14 DataComp as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336