skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

CLIP ViT-L-14 DataComp

laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K

published Apr 2023 · updated May 2023

CLIP ViT-L-14 DataComp is a zero-shot image model that performs image classification and retrieval by matching images and text using a Vision Transformer backbone trained on the DataComp-1B dataset.

status
coming soon
API providers
0
downloads / mo
59K
license
mit

specs

TaskZero-shot image classification, image-text retrieval
ArchitectureViT-L/14
Training DataDataComp-1B (1.4 billion image-text pairs)
ImageNet Accuracy79.2% zero-shot top-1

about this model

CLIP ViT-L/14 (DataComp-1B) is a zero-shot image classification and retrieval model that embeds images and text into a shared space, enabling open-vocabulary recognition without task-specific fine-tuning. It uses a Vision Transformer (ViT-L/14) architecture and was trained on 1.4 billion image-text pairs from the DataComp-1B dataset, which was curated as part of the DataComp benchmark for dataset design.

The model achieves a zero-shot top-1 accuracy of 79.2% on ImageNet-1k, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points under the same training procedure and compute budget. This result comes from the standardized training and evaluation pipeline defined by the DataComp benchmark, which tests on 38 diverse downstream datasets spanning classification, retrieval, and other vision-language tasks.

As a hosted API on gigarouter, the model can be used for zero-shot image classification, image‑text retrieval, and as a foundation for linear probing or fine‑tuning. It is a research‑oriented model; deployed use cases require thorough domain‑specific testing due to performance variability across class taxonomies. Surveillance and facial recognition applications are explicitly out of scope.

best for

FAQ

What is CLIP ViT-L-14 DataComp?

It is a CLIP model with a ViT-L/14 architecture trained on the DataComp-1B dataset, enabling zero-shot image classification and text-image matching.

How does it compare to OpenAI's CLIP ViT-L/14?

It outperforms OpenAI's CLIP ViT-L/14 by 3.7 percentage points on ImageNet zero-shot accuracy (79.2% vs 75.5%) under the same training procedure and compute budget.

What training data was used?

The model was trained on DataComp-1B, a dataset of 1.4 billion image-text pairs curated from Common Crawl.

How can I use this model via the gigarouter API?

You can call the model using an OpenAI-compatible endpoint. Provide your API key and send image and text inputs as required by the endpoint documentation.

What input formats does it accept?

The model accepts images (as URLs or base64-encoded) and text prompts for zero-shot classification or retrieval tasks.

not yet live

We're benchmarking and onboarding CLIP ViT-L-14 DataComp as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →