CLIP ViT-H/14
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
published Sep 2022 · updated Jan 2025
CLIP ViT-H/14 is a zero-shot image model that performs classification and text-image retrieval trained on 2 billion English image-text pairs from LAION-5B using OpenCLIP.
specs
| Task | Zero-shot image classification, image and text retrieval |
| Architecture | CLIP ViT-H/14 |
| Training Data | LAION-2B English subset of LAION-5B |
| Zero-shot ImageNet-1k Top-1 | 78.0% |
| Framework | OpenCLIP |
about this model
CLIP ViT-H/14 (LAION-2B) is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification and retrieval without task-specific fine-tuning. It was trained on the 2-billion-sample English subset of LAION-5B using OpenCLIP, with compute provided by Stability AI.
Capabilities
The model supports zero-shot image classification, image-text retrieval, linear probing, and captioning evaluation. It can classify images into arbitrary categories defined by natural language prompts, making it suitable for open-vocabulary tasks.
Training Data
Training used the LAION-5B English subset, an uncurated dataset of image–alt-text pairs crawled from the public internet. The dataset metadata is licensed under CC-BY 4.0; images remain under their original copyright. The uncurated nature means potentially harmful content may be present, and the model is intended for research use only.
Benchmark Performance
The model achieves 78.0% zero-shot top-1 accuracy on ImageNet-1k. Evaluation on the VTAB+ benchmark suite (which includes VTAB and robustness datasets) and on COCO/Flickr for retrieval is documented in the OpenCLIP benchmark repository. Per-dataset results across 38 datasets are available in the OpenCLIP results CSV.
Supported Evaluation Tasks
- Zero-shot classification
- Zero-shot retrieval (image-to-text and text-to-image)
- Linear probing
- Captioning evaluation
For detailed per-dataset scores, refer to the OpenCLIP repository and the CLIP_benchmark suite.
best for
- ·Zero-shot classification on custom image categories without fine-tuning
- ·Image-text similarity search and retrieval in large datasets
- ·Linear probe fine-tuning for downstream image classification tasks
- ·Guiding text-to-image generation as a conditioning model
FAQ
It is designed for zero-shot image classification and image-text retrieval, enabling predictions on unlabelled data without task-specific training.
It achieves 78.0% top-1 zero-shot accuracy on ImageNet-1k.
The model was trained on the 2 billion English image-text sample subset of LAION-5B.
Send requests to the gigarouter OpenAI-compatible endpoint with your API key; the model accepts images and text for embedding or scoring.
Yes, it supports fine-tuning for image classification and linear probe tasks, as described in the model card.
We're benchmarking and onboarding CLIP ViT-H/14 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.