CLIP ViT-L/14 336px
openai/clip-vit-large-patch14-336
published Apr 2022 · updated Oct 2022
CLIP ViT-L/14 336px is a zero-shot image classification model that uses contrastive learning on 400M image-text pairs to match images with arbitrary text descriptions.
specs
| Task | Zero-Shot Image Classification |
| Architecture | ViT-L/14 with 336x336 input resolution |
| Parameters | 428M |
| License | MIT |
| Image Resolution | 336x336 |
about this model
OpenAI CLIP ViT-L/14@336px is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification on arbitrary visual categories without task-specific training. The model uses a Vision Transformer (ViT-L/14) architecture with 428 million parameters, processing images at 336x336 pixel resolution and text with a context length of 77 tokens. It was trained on 400 million (image, text) pairs collected from the internet, as described in the CLIP paper (arXiv:2103.00020).
On the ImageNet zero-shot classification benchmark, ViT-L/14@336px achieves 76.2% top-1 accuracy, outperforming the standard ViT-L/14 model (75.3%) and the larger ResNet101x3 (73.6%). The model produces 1024-dimensional image embeddings and 768-dimensional text embeddings.
Key capabilities include:
- Zero-shot classification across any visual categories defined by natural language prompts
- Image-text similarity scoring for retrieval and ranking tasks
- Strong performance on distribution shifts and out-of-distribution generalization compared to supervised models of similar size
The model is hosted by Gigarouter as a managed, OpenAI-compatible API with no installation required. It has received 306 likes on Hugging Face and accumulated over 216 million total downloads as of the latest data. The CLIP code is released under the MIT license.
best for
- ·Classifying images into custom categories without any training data
- ·Image retrieval and search by natural language text queries
- ·Zero-shot object recognition and labeling
FAQ
It is best for zero-shot classification, where you can classify images into any set of categories by providing text prompts, without needing any training examples.
It has 428 million parameters.
It expects images resized to 336x336 pixels.
It is released under the MIT license.
Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, providing an image and a list of candidate text labels. The API returns similarity scores for each label.
We're benchmarking and onboarding CLIP ViT-L/14 336px as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.