CLIP ViT-L-14 DataComp
laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
published Apr 2023 · updated May 2023
CLIP ViT-L-14 DataComp is a zero-shot image model that performs image classification and retrieval by matching images and text using a Vision Transformer backbone trained on the DataComp-1B dataset.
specs
| Task | Zero-shot image classification, image-text retrieval |
| Architecture | ViT-L/14 |
| Training Data | DataComp-1B (1.4 billion image-text pairs) |
| ImageNet Accuracy | 79.2% zero-shot top-1 |
about this model
CLIP ViT-L/14 (DataComp-1B) is a zero-shot image classification and retrieval model that embeds images and text into a shared space, enabling open-vocabulary recognition without task-specific fine-tuning. It uses a Vision Transformer (ViT-L/14) architecture and was trained on 1.4 billion image-text pairs from the DataComp-1B dataset, which was curated as part of the DataComp benchmark for dataset design.
The model achieves a zero-shot top-1 accuracy of 79.2% on ImageNet-1k, outperforming OpenAI’s CLIP ViT-L/14 by 3.7 percentage points under the same training procedure and compute budget. This result comes from the standardized training and evaluation pipeline defined by the DataComp benchmark, which tests on 38 diverse downstream datasets spanning classification, retrieval, and other vision-language tasks.
As a hosted API on gigarouter, the model can be used for zero-shot image classification, image‑text retrieval, and as a foundation for linear probing or fine‑tuning. It is a research‑oriented model; deployed use cases require thorough domain‑specific testing due to performance variability across class taxonomies. Surveillance and facial recognition applications are explicitly out of scope.
best for
- ·Zero-shot image classification with custom categories
- ·Image-text retrieval for search and recommendation
- ·Building multimodal search engines
FAQ
It is a CLIP model with a ViT-L/14 architecture trained on the DataComp-1B dataset, enabling zero-shot image classification and text-image matching.
It outperforms OpenAI's CLIP ViT-L/14 by 3.7 percentage points on ImageNet zero-shot accuracy (79.2% vs 75.5%) under the same training procedure and compute budget.
The model was trained on DataComp-1B, a dataset of 1.4 billion image-text pairs curated from Common Crawl.
You can call the model using an OpenAI-compatible endpoint. Provide your API key and send image and text inputs as required by the endpoint documentation.
The model accepts images (as URLs or base64-encoded) and text prompts for zero-shot classification or retrieval tasks.
We're benchmarking and onboarding CLIP ViT-L-14 DataComp as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.