CLIP ViT-L/14 336px

openai/clip-vit-large-patch14-336

published Apr 2022 · updated Oct 2022

CLIP ViT-L/14 336px is a zero-shot image classification model that uses contrastive learning on 400M image-text pairs to match images with arbitrary text descriptions.

status

coming soon

API providers

downloads / mo

3.4M

specs

Task	Zero-Shot Image Classification
Architecture	ViT-L/14 with 336x336 input resolution
Parameters	428M
License	MIT
Image Resolution	336x336

about this model

OpenAI CLIP ViT-L/14@336px is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification on arbitrary visual categories without task-specific training. The model uses a Vision Transformer (ViT-L/14) architecture with 428 million parameters, processing images at 336x336 pixel resolution and text with a context length of 77 tokens. It was trained on 400 million (image, text) pairs collected from the internet, as described in the CLIP paper (arXiv:2103.00020).

On the ImageNet zero-shot classification benchmark, ViT-L/14@336px achieves 76.2% top-1 accuracy, outperforming the standard ViT-L/14 model (75.3%) and the larger ResNet101x3 (73.6%). The model produces 1024-dimensional image embeddings and 768-dimensional text embeddings.

Key capabilities include:

Zero-shot classification across any visual categories defined by natural language prompts
Image-text similarity scoring for retrieval and ranking tasks
Strong performance on distribution shifts and out-of-distribution generalization compared to supervised models of similar size

The model is hosted by Gigarouter as a managed, OpenAI-compatible API with no installation required. It has received 306 likes on Hugging Face and accumulated over 216 million total downloads as of the latest data. The CLIP code is released under the MIT license.

best for

·Classifying images into custom categories without any training data
·Image retrieval and search by natural language text queries
·Zero-shot object recognition and labeling

FAQ

What is this model best for?

It is best for zero-shot classification, where you can classify images into any set of categories by providing text prompts, without needing any training examples.

How many parameters does the model have?

It has 428 million parameters.

What input image size does the model expect?

It expects images resized to 336x336 pixels.

What license is the model released under?

It is released under the MIT license.

How can I use this model via the gigarouter API?

Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, providing an image and a list of candidate text labels. The API returns similarity scores for each label.

not yet live

We're benchmarking and onboarding CLIP ViT-L/14 336px as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

siglip-so400m-patch14-384

1.8M dl/mo