CLIP ViT-H/14

laion/CLIP-ViT-H-14-laion2B-s32B-b79K

published Sep 2022 · updated Jan 2025

CLIP ViT-H/14 is a zero-shot image model that performs classification and text-image retrieval trained on 2 billion English image-text pairs from LAION-5B using OpenCLIP.

est. price

~$0.235

/ 1k images · estimated, set at launch

API providers

downloads / mo

416K

license

mit

specs

Task	Zero-shot image classification, image and text retrieval
Architecture	CLIP ViT-H/14
Training Data	LAION-2B English subset of LAION-5B
Zero-shot ImageNet-1k Top-1	78.0%
Framework	OpenCLIP

about this model

CLIP ViT-H/14 (LAION-2B) is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification and retrieval without task-specific fine-tuning. It was trained on the 2-billion-sample English subset of LAION-5B using OpenCLIP, with compute provided by Stability AI.

Capabilities

The model supports zero-shot image classification, image-text retrieval, linear probing, and captioning evaluation. It can classify images into arbitrary categories defined by natural language prompts, making it suitable for open-vocabulary tasks.

Training Data

Training used the LAION-5B English subset, an uncurated dataset of image–alt-text pairs crawled from the public internet. The dataset metadata is licensed under CC-BY 4.0; images remain under their original copyright. The uncurated nature means potentially harmful content may be present, and the model is intended for research use only.

Benchmark Performance

The model achieves 78.0% zero-shot top-1 accuracy on ImageNet-1k. Evaluation on the VTAB+ benchmark suite (which includes VTAB and robustness datasets) and on COCO/Flickr for retrieval is documented in the OpenCLIP benchmark repository. Per-dataset results across 38 datasets are available in the OpenCLIP results CSV.

Supported Evaluation Tasks

Zero-shot classification
Zero-shot retrieval (image-to-text and text-to-image)
Linear probing
Captioning evaluation

For detailed per-dataset scores, refer to the OpenCLIP repository and the CLIP_benchmark suite.

best for

·Zero-shot classification on custom image categories without fine-tuning
·Image-text similarity search and retrieval in large datasets
·Linear probe fine-tuning for downstream image classification tasks
·Guiding text-to-image generation as a conditioning model

FAQ

What is the primary use of this model?

It is designed for zero-shot image classification and image-text retrieval, enabling predictions on unlabelled data without task-specific training.

How accurate is it on ImageNet?

It achieves 78.0% top-1 zero-shot accuracy on ImageNet-1k.

What data was it trained on?

The model was trained on the 2 billion English image-text sample subset of LAION-5B.

How can I use this model via the gigarouter API?

Send requests to the gigarouter OpenAI-compatible endpoint with your API key; the model accepts images and text for embedding or scoring.

Can this model be fine-tuned for downstream tasks?

Yes, it supports fine-tuning for image classification and linear probe tasks, as described in the model card.

not yet live

We're benchmarking and onboarding CLIP ViT-H/14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336