CLIP ViT-L/14
openai/clip-vit-large-patch14
published Mar 2022 · updated Sep 2023
CLIP ViT-L/14 is a zero-shot-image model that learns visual concepts from natural language supervision and can classify images without task-specific training.
specs
| Task | Zero-shot image classification and image-text similarity |
| Architecture | ViT-L/14 Vision Transformer image encoder + masked self-attention Transformer text encoder |
| Parameters | ~428M (ViT-L/14) |
| License | MIT |
about this model
openai/clip-vit-large-patch14 is a zero-shot image classification model that uses contrastive language-image pre-training to classify images without task-specific fine-tuning. It encodes images and natural language text into a shared embedding space, enabling it to predict the most relevant text caption for a given image by selecting from user-provided candidate labels.
Architecture and Training
The model uses a Vision Transformer (ViT-L/14) as the image encoder and a masked self-attention Transformer as the text encoder. Both encoders were trained jointly on 400 million (image, text) pairs collected from the internet to maximize the similarity of correct pairs via a contrastive loss. The ViT-L/14 variant was released in January 2022, with a higher-resolution 336px version following in April 2022.
Performance and Benchmarks
CLIP zero-shot performance matches the accuracy of a fully supervised ResNet-50 on ImageNet without using any of the 1.28 million labeled training examples. The model was evaluated on over 30 diverse computer vision benchmarks spanning OCR (e.g., MNIST, SVHN, IIIT5K), action recognition (UCF101, Kinetics700), geo-localization (Country211), fine-grained classification (Food101, Stanford Cars, FGVC Aircraft), and natural image datasets (CIFAR-10/100, Caltech101, Flowers102). Detailed results per dataset are provided in the original paper.
Notable fairness evaluations on the FairFace dataset found that CLIP achieved greater than 96% accuracy across all race groups for gender classification, with an average of 93% for racial classification and 63% for age classification. These numbers reflect model behavior and should be interpreted with awareness of dataset labeling and intended out-of-scope use.
Model Visualization
Key Strengths
- Zero-shot transfer – no dataset-specific training or labels required; natural language defines the classification task.
- Broad applicability – performs non-trivially across OCR, action recognition, geo-localization, fine-grained recognition, and general object classification.
- Competitive baseline – matches or approaches fully supervised performance on many benchmarks (e.g., ImageNet zero-shot).
As a managed API on gigarouter, the model is available for inference with an OpenAI-compatible interface. Users submit images and a set of candidate text labels, receiving similarity scores or predicted probabilities without managing infrastructure or model weights.
best for
- ·Zero-shot image classification without task-specific training
- ·Image-text similarity search and retrieval
- ·General-purpose visual understanding across diverse domains
FAQ
It is best for zero-shot image classification and image-text similarity tasks, where you can describe categories in natural language without training a custom model.
ViT-L/14 has ~428M parameters, making it one of the largest CLIP variants, offering higher accuracy at the cost of more compute compared to smaller models like ViT-B/32.
The model is released under the MIT license.
It expects an image (e.g., JPEG or PNG) and a list of text prompts. The image is preprocessed to 224x224 pixels (or 336x336 for the @336px variant).
Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image and text prompts in the request body.
We're benchmarking and onboarding CLIP ViT-L/14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.