ViT-SO400M-14 SigLIP
timm/ViT-SO400M-14-SigLIP
published Oct 2023 · updated Oct 2023
ViT-SO400M-14 SigLIP is a zero-shot-image model that uses a sigmoid loss for language-image pre-training, enabling contrastive image-text classification.
specs
| Task | Zero-Shot Image Classification |
| Architecture | ViT-SO400M-14 (Vision Transformer with patch size 14) |
| Parameters | 400M |
| License | Unknown (not specified in card) |
about this model
Model Architecture and Training
This model implements the SigLIP approach, which employs a pairwise sigmoid loss instead of standard contrastive softmax normalization. This design operates solely on image-text pairs without requiring global pairwise similarity normalization, allowing effective training at both small and large batch sizes. The model was trained on the WebLI dataset and converted from original JAX checkpoints in Big Vision to PyTorch weights usable in both OpenCLIP and timm frameworks.
Key Strengths
- Efficient training: Combined with Locked-image Tuning, achieves 84.5% ImageNet zero-shot accuracy in two days using only four TPUv4 chips.
- Flexible batch scaling: The sigmoid loss disentangles batch size from loss computation, enabling effective training with batch sizes up to one million, though benefits diminish beyond 32k.
- Improved small-batch performance: Performs better than softmax-based approaches at smaller batch sizes.
Benchmark Results
The model achieves 84.5% zero-shot accuracy on ImageNet, demonstrating strong performance for open-vocabulary classification tasks without task-specific training data.
Usage Through gigarouter API
As a hosted API, gigarouter provides access to ViT-SO400M-14-SigLIP for zero-shot classification. The model accepts images and arbitrary text labels, returning probability scores for each label. No local installation or model loading is required.
Citation
Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343.
best for
- ·Zero-shot classification of images using custom text labels
- ·Extracting image embeddings for downstream tasks
FAQ
SigLIP uses a pairwise sigmoid loss that does not require global normalization, allowing flexible batch sizes and better performance at smaller batches.
It was trained on WebLI.
You can use it via the gigarouter OpenAI-compatible endpoint with an API key, or locally with OpenCLIP or timm as shown in the model card.
The model expects 224x224 pixel images.
Yes, it supports both image and text encoding for contrastive tasks, though the timm version provides only image embeddings.
We're benchmarking and onboarding ViT-SO400M-14 SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.