CLIP ViT-Base Patch32
Xenova/clip-vit-base-patch32
published May 2023 · updated Jul 2025
CLIP ViT-Base Patch32 is a zero-shot image classification model that uses a vision transformer and text encoder to match images to textual descriptions without task-specific training.
specs
| Task | Zero-shot Image Classification |
| Architecture | ViT-B/32 (Vision Transformer with 32x32 patch size) + Text Transformer |
| Training Data | 400 million (image, text) pairs from the internet |
| Release Date | January 2021 |
| License | Not specified (research output) |
about this model
Xenova/clip-vit-base-patch32 is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification without task-specific training data. It uses a ViT-B/32 Vision Transformer (patch size 32x32) as the image encoder and a masked self-attention Transformer as the text encoder, trained via contrastive loss on 400 million (image, text) pairs collected from the internet.
The model achieves zero-shot ImageNet accuracy matching the original ResNet-50, without using any of the 1.28 million labeled training examples. This benchmark is reported in the CLIP paper (arXiv:2103.00020) and highlights the model’s ability to generalize across diverse visual concepts from natural language descriptions alone.
As a specialist model hosted by gigarouter, it is served as an OpenAI-compatible API with no installation or environment setup required. The ONNX-optimized weights used here are identical to the original OpenAI CLIP ViT-B/32 release (January 2021) and are compatible with Transformers.js workflows. The model has accumulated over 62 million downloads on Hugging Face.
Gigarouter benchmarks and hosts this model for production zero-shot image tasks, providing consistent latency and throughput without the need to manage infrastructure or conversion pipelines.
best for
- ·Classifying images into any set of custom categories without retraining
- ·Retrieving images from a database using natural language queries
- ·Building visual search or recommendation systems with flexible label sets
FAQ
The API accepts an image URL or base64-encoded image and a list of candidate text labels; it returns scores for each label.
It matches the zero-shot accuracy of the original ResNet-50 on ImageNet without using any training examples from that dataset.
The original CLIP ViT-B/32 has approximately 151 million parameters (86M vision, 65M text), but this ONNX version is optimized for web inference.
Use the OpenAI-compatible endpoint with your API key, sending a POST request with the image and candidate labels.
The original model was released as a research output with no explicit license; consult the model card for restrictions.
We're benchmarking and onboarding CLIP ViT-Base Patch32 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.