ViT B 16 SigLIP
timm/ViT-B-16-SigLIP
published Oct 2023 · updated Oct 2023
ViT B 16 SigLIP is a zero-shot image model that classifies images by comparing their embeddings to text prompts using a sigmoid loss function.
specs
| Task | Zero-Shot Image Classification (Contrastive Image-Text) |
| Architecture | ViT-B/16 (Vision Transformer, patch size 16) |
| Dataset | WebLI |
| License | Apache 2.0 |
about this model
ViT-B-16-SigLIP is a zero-shot image classification model that uses a sigmoid loss for language-image pre-training (SigLIP), enabling contrastive matching between images and arbitrary text labels without fine-tuning.
Model details
Trained on the WebLI dataset and converted to PyTorch from the original JAX checkpoints, the model is designed for zero-shot classification and image-text retrieval. Its sigmoid loss function operates on individual image-text pairs, eliminating the need for global pairwise normalization found in standard contrastive losses. This allows flexible batch scaling — performance is strong at batch sizes as small as 32k, while the loss still benefits from larger batches up to one million.
Key strengths and results
- Achieves 84.5% ImageNet zero-shot accuracy when combined with Locked-image Tuning, trained on only four TPUv4 chips in two days (as reported in the SigLIP paper, ICCV 2023 Oral).
- Licensed under Apache 2.0, suitable for commercial and research use.
- Widely adopted: over 101,000 downloads on Hugging Face as of early 2025.
As a hosted API on gigarouter, the model is available for zero-shot classification tasks — simply provide an image and a list of candidate labels to obtain sigmoid-normalized probabilities.
best for
- ·Zero-shot classification of images with custom text labels
- ·Image-text similarity scoring for retrieval
- ·Building multimodal search applications
FAQ
The model expects images resized to 224x224 pixels.
It is released under the Apache 2.0 license.
SigLIP uses a sigmoid loss instead of softmax contrastive loss, enabling training without global pairwise normalization and allowing larger batch sizes.
You can call the OpenAI-compatible endpoint with your API key, providing an image and text prompts for zero-shot classification.
The model was trained on the WebLI dataset (Web Language-Image).
We're benchmarking and onboarding ViT B 16 SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.