CLIP ViT-Base Patch32

Xenova/clip-vit-base-patch32

published May 2023 · updated Jul 2025

CLIP ViT-Base Patch32 is a zero-shot image classification model that uses a vision transformer and text encoder to match images to textual descriptions without task-specific training.

status

coming soon

API providers

downloads / mo

154.7K

specs

Task	Zero-shot Image Classification
Architecture	ViT-B/32 (Vision Transformer with 32x32 patch size) + Text Transformer
Training Data	400 million (image, text) pairs from the internet
Release Date	January 2021
License	Not specified (research output)

about this model

Xenova/clip-vit-base-patch32 is a zero-shot image classification model that maps images and text into a shared embedding space, enabling classification without task-specific training data. It uses a ViT-B/32 Vision Transformer (patch size 32x32) as the image encoder and a masked self-attention Transformer as the text encoder, trained via contrastive loss on 400 million (image, text) pairs collected from the internet.

The model achieves zero-shot ImageNet accuracy matching the original ResNet-50, without using any of the 1.28 million labeled training examples. This benchmark is reported in the CLIP paper (arXiv:2103.00020) and highlights the model’s ability to generalize across diverse visual concepts from natural language descriptions alone.

As a specialist model hosted by gigarouter, it is served as an OpenAI-compatible API with no installation or environment setup required. The ONNX-optimized weights used here are identical to the original OpenAI CLIP ViT-B/32 release (January 2021) and are compatible with Transformers.js workflows. The model has accumulated over 62 million downloads on Hugging Face.

Gigarouter benchmarks and hosts this model for production zero-shot image tasks, providing consistent latency and throughput without the need to manage infrastructure or conversion pipelines.

best for

·Classifying images into any set of custom categories without retraining
·Retrieving images from a database using natural language queries
·Building visual search or recommendation systems with flexible label sets

FAQ

What is the input format for the API?

The API accepts an image URL or base64-encoded image and a list of candidate text labels; it returns scores for each label.

How does this model compare to ResNet-50 on ImageNet zero-shot?

It matches the zero-shot accuracy of the original ResNet-50 on ImageNet without using any training examples from that dataset.

What is the model size in parameters?

The original CLIP ViT-B/32 has approximately 151 million parameters (86M vision, 65M text), but this ONNX version is optimized for web inference.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, sending a POST request with the image and candidate labels.

Is the model free to use for commercial applications?

The original model was released as a research output with no explicit license; consult the model card for restrictions.

not yet live

We're benchmarking and onboarding CLIP ViT-Base Patch32 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336