ViT-B-16 SigLIP2 256

timm/ViT-B-16-SigLIP2-256

published Feb 2025 · updated Feb 2025

ViT-B-16 SigLIP2 256 is a zero-shot-image model that performs contrastive image-text matching for zero-shot classification and retrieval.

status

coming soon

API providers

downloads / mo

156.4K

license

apache-2.0

specs

Task	Zero-Shot Image Classification
Architecture	ViT-B-16
Parameters	86M
License	Unknown

about this model

ViT-B-16-SigLIP2-256 is a zero-shot image classification and image-text retrieval model that builds on the SigLIP 2 architecture, trained on the WebLI dataset with a contrastive sigmoid loss. It uses a Gemma tokenizer (256k vocabulary) and has 86M parameters. This model is hosted as a managed API on gigarouter, providing endpoints that return classification or similarity scores for image-text pairs without requiring local infrastructure.

Key benchmark results from the SigLIP 2 paper and Big Vision repository:

Task	Metric	Score
ImageNet zero-shot classification	Top-1 accuracy	79.1%
COCO text-to-image retrieval	Recall@1	53.2
COCO image-to-text retrieval	Recall@1	69.7

Capabilities and improvements over SigLIP 1

SigLIP 2 models outperform their SigLIP counterparts at all model scales in zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). The unified training recipe incorporates captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation. These additions yield significant gains on localization and dense prediction tasks. Furthermore, the diverse data mixture includes de-biasing techniques, resulting in better multilingual understanding and improved fairness.

The ViT-B-16 variant operates at a 256-pixel input resolution and is the smallest of four released sizes (86M, 303M, 400M, 1B parameters), offering a cost-effective balance of performance and inference speed.

best for

·Zero-shot image classification without task-specific training
·Image-text retrieval (text-to-image and image-to-text)
·Multilingual vision-language understanding

FAQ

What is the model's ImageNet zero-shot accuracy?

79.1%.

What tokenizer does this model use?

It uses a Gemma tokenizer with a vocabulary size of 256k.

How does SigLIP 2 compare to the original SigLIP?

SigLIP 2 outperforms SigLIP at all model scales in zero-shot classification, retrieval, and VLM transfer.

What input format does the model expect?

It expects preprocessed images and tokenized text; the model outputs normalized image and text features for similarity scoring.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name ViT-B-16-SigLIP2-256.

not yet live

We're benchmarking and onboarding ViT-B-16 SigLIP2 256 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336