CLIP ViT-L/14

openai/clip-vit-large-patch14

published Mar 2022 · updated Sep 2023

CLIP ViT-L/14 is a zero-shot-image model that learns visual concepts from natural language supervision and can classify images without task-specific training.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

12.4M

specs

Task	Zero-shot image classification and image-text similarity
Architecture	ViT-L/14 Vision Transformer image encoder + masked self-attention Transformer text encoder
Parameters	~428M (ViT-L/14)
License	MIT

about this model

openai/clip-vit-large-patch14 is a zero-shot image classification model that uses contrastive language-image pre-training to classify images without task-specific fine-tuning. It encodes images and natural language text into a shared embedding space, enabling it to predict the most relevant text caption for a given image by selecting from user-provided candidate labels.

Architecture and Training

The model uses a Vision Transformer (ViT-L/14) as the image encoder and a masked self-attention Transformer as the text encoder. Both encoders were trained jointly on 400 million (image, text) pairs collected from the internet to maximize the similarity of correct pairs via a contrastive loss. The ViT-L/14 variant was released in January 2022, with a higher-resolution 336px version following in April 2022.

Performance and Benchmarks

CLIP zero-shot performance matches the accuracy of a fully supervised ResNet-50 on ImageNet without using any of the 1.28 million labeled training examples. The model was evaluated on over 30 diverse computer vision benchmarks spanning OCR (e.g., MNIST, SVHN, IIIT5K), action recognition (UCF101, Kinetics700), geo-localization (Country211), fine-grained classification (Food101, Stanford Cars, FGVC Aircraft), and natural image datasets (CIFAR-10/100, Caltech101, Flowers102). Detailed results per dataset are provided in the original paper.

Notable fairness evaluations on the FairFace dataset found that CLIP achieved greater than 96% accuracy across all race groups for gender classification, with an average of 93% for racial classification and 63% for age classification. These numbers reflect model behavior and should be interpreted with awareness of dataset labeling and intended out-of-scope use.

Model Visualization

Key Strengths

Zero-shot transfer – no dataset-specific training or labels required; natural language defines the classification task.
Broad applicability – performs non-trivially across OCR, action recognition, geo-localization, fine-grained recognition, and general object classification.
Competitive baseline – matches or approaches fully supervised performance on many benchmarks (e.g., ImageNet zero-shot).

As a managed API on gigarouter, the model is available for inference with an OpenAI-compatible interface. Users submit images and a set of candidate text labels, receiving similarity scores or predicted probabilities without managing infrastructure or model weights.

best for

·Zero-shot image classification without task-specific training
·Image-text similarity search and retrieval
·General-purpose visual understanding across diverse domains

FAQ

What is CLIP ViT-L/14 best used for?

It is best for zero-shot image classification and image-text similarity tasks, where you can describe categories in natural language without training a custom model.

How does CLIP ViT-L/14 compare in size to other CLIP variants?

ViT-L/14 has ~428M parameters, making it one of the largest CLIP variants, offering higher accuracy at the cost of more compute compared to smaller models like ViT-B/32.

What is the license for CLIP ViT-L/14?

The model is released under the MIT license.

What input format does the model expect?

It expects an image (e.g., JPEG or PNG) and a list of text prompts. The image is preprocessed to 224x224 pixels (or 336x336 for the @336px variant).

How can I call CLIP ViT-L/14 via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image and text prompts in the request body.

not yet live

We're benchmarking and onboarding CLIP ViT-L/14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336

siglip-so400m-patch14-384

1.8M dl/mo