skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

ViT-B-16 SigLIP2 256

timm/ViT-B-16-SigLIP2-256

published Feb 2025 · updated Feb 2025

ViT-B-16 SigLIP2 256 is a zero-shot-image model that performs contrastive image-text matching for zero-shot classification and retrieval.

status
coming soon
API providers
0
downloads / mo
156.4K
license
apache-2.0

specs

TaskZero-Shot Image Classification
ArchitectureViT-B-16
Parameters86M
LicenseUnknown

about this model

ViT-B-16-SigLIP2-256 is a zero-shot image classification and image-text retrieval model that builds on the SigLIP 2 architecture, trained on the WebLI dataset with a contrastive sigmoid loss. It uses a Gemma tokenizer (256k vocabulary) and has 86M parameters. This model is hosted as a managed API on gigarouter, providing endpoints that return classification or similarity scores for image-text pairs without requiring local infrastructure.

Key benchmark results from the SigLIP 2 paper and Big Vision repository:

TaskMetricScore
ImageNet zero-shot classificationTop-1 accuracy79.1%
COCO text-to-image retrievalRecall@153.2
COCO image-to-text retrievalRecall@169.7

Capabilities and improvements over SigLIP 1

SigLIP 2 models outperform their SigLIP counterparts at all model scales in zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). The unified training recipe incorporates captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation. These additions yield significant gains on localization and dense prediction tasks. Furthermore, the diverse data mixture includes de-biasing techniques, resulting in better multilingual understanding and improved fairness.

The ViT-B-16 variant operates at a 256-pixel input resolution and is the smallest of four released sizes (86M, 303M, 400M, 1B parameters), offering a cost-effective balance of performance and inference speed.

best for

FAQ

What is the model's ImageNet zero-shot accuracy?

79.1%.

What tokenizer does this model use?

It uses a Gemma tokenizer with a vocabulary size of 256k.

How does SigLIP 2 compare to the original SigLIP?

SigLIP 2 outperforms SigLIP at all model scales in zero-shot classification, retrieval, and VLM transfer.

What input format does the model expect?

It expects preprocessed images and tokenized text; the model outputs normalized image and text features for similarity scoring.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name ViT-B-16-SigLIP2-256.

not yet live

We're benchmarking and onboarding ViT-B-16 SigLIP2 256 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →