ViT-B-16 SigLIP2 256
timm/ViT-B-16-SigLIP2-256
published Feb 2025 · updated Feb 2025
ViT-B-16 SigLIP2 256 is a zero-shot-image model that performs contrastive image-text matching for zero-shot classification and retrieval.
specs
| Task | Zero-Shot Image Classification |
| Architecture | ViT-B-16 |
| Parameters | 86M |
| License | Unknown |
about this model
ViT-B-16-SigLIP2-256 is a zero-shot image classification and image-text retrieval model that builds on the SigLIP 2 architecture, trained on the WebLI dataset with a contrastive sigmoid loss. It uses a Gemma tokenizer (256k vocabulary) and has 86M parameters. This model is hosted as a managed API on gigarouter, providing endpoints that return classification or similarity scores for image-text pairs without requiring local infrastructure.
Key benchmark results from the SigLIP 2 paper and Big Vision repository:
| Task | Metric | Score |
|---|---|---|
| ImageNet zero-shot classification | Top-1 accuracy | 79.1% |
| COCO text-to-image retrieval | Recall@1 | 53.2 |
| COCO image-to-text retrieval | Recall@1 | 69.7 |
Capabilities and improvements over SigLIP 1
SigLIP 2 models outperform their SigLIP counterparts at all model scales in zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). The unified training recipe incorporates captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation. These additions yield significant gains on localization and dense prediction tasks. Furthermore, the diverse data mixture includes de-biasing techniques, resulting in better multilingual understanding and improved fairness.
The ViT-B-16 variant operates at a 256-pixel input resolution and is the smallest of four released sizes (86M, 303M, 400M, 1B parameters), offering a cost-effective balance of performance and inference speed.
best for
- ·Zero-shot image classification without task-specific training
- ·Image-text retrieval (text-to-image and image-to-text)
- ·Multilingual vision-language understanding
FAQ
79.1%.
It uses a Gemma tokenizer with a vocabulary size of 256k.
SigLIP 2 outperforms SigLIP at all model scales in zero-shot classification, retrieval, and VLM transfer.
It expects preprocessed images and tokenized text; the model outputs normalized image and text features for similarity scoring.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name ViT-B-16-SigLIP2-256.
We're benchmarking and onboarding ViT-B-16 SigLIP2 256 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.