models / zero-shot image · coming soon

SigLIP 2 So400m

google/siglip2-so400m-patch16-256

published Feb 2025 · updated Feb 2025

SigLIP 2 So400m is a multilingual vision-language encoder for zero-shot image classification and image-text retrieval.

est. price

~$0.235

/ 1k images · estimated, set at launch

API providers

downloads / mo

521.6K

license

apache-2.0

specs

Task	Zero-Shot Image Classification & Image-Text Retrieval
Architecture	SigLIP 2 (Vision Transformer, So400m)
Parameters	400M
Input Resolution	256x256
Training Data	WebLI dataset

about this model

google/siglip2-so400m-patch16-256 is a zero-shot image classification and vision-language encoder model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation for improved semantic understanding, localization, and dense feature extraction.

The model is pre-trained on the WebLI dataset using up to 2048 TPU-v5e chips. It supports multiple resolutions and preserves the native aspect ratio of input images, and the training recipe incorporates de-biasing techniques for better multilingual understanding and fairness.

Key Capabilities

Zero-shot image classification and image-text retrieval
Vision encoder for VLMs and other vision tasks
Localization and dense prediction tasks benefit from the new training objectives (decoder loss, global-local masked prediction loss)

Benchmark Results

SigLIP 2 models outperform their SigLIP counterparts at all model scales in zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). The So400m (400M parameter) variant delivers significant improvements on localization and dense prediction tasks.

Task / Metric	Performance vs. SigLIP (same scale)
Zero-shot classification	Improved
Image-text retrieval	Improved
VLM visual transfer	Improved
Localization & dense prediction	Significantly improved

Evaluation results table from SigLIP 2 paper showing performance across multiple vision-language benchmarks

For further details, refer to the SigLIP 2 paper.

best for

·Zero-shot image classification with custom candidate labels
·Image-text retrieval across multiple languages
·Vision encoder for Vision-Language Models (VLMs)

FAQ

What tasks is SigLIP 2 So400m best for?

It excels at zero-shot image classification, image-text retrieval, and as a vision encoder for VLMs, with improved multilingual understanding and dense features.

What is the input format for the API?

The API accepts images (URL or base64) and candidate labels (for classification) or text queries (for retrieval). Use the gigarouter OpenAI-compatible endpoint with an API key.

How large is the model and what are its compute requirements?

It has 400M parameters and was trained on up to 2048 TPU-v5e chips. Inference is efficient on modern GPUs or via the hosted API.

What datasets was SigLIP 2 So400m trained on?

It was pre-trained on the WebLI dataset, a large multilingual image-text corpus.

How do I call this model via the gigarouter API?

Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name "google/siglip2-so400m-patch16-256" and providing image and text inputs as required.

not yet live

We're benchmarking and onboarding SigLIP 2 So400m as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336