skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP 2 So400m

google/siglip2-so400m-patch16-256

published Feb 2025 · updated Feb 2025

SigLIP 2 So400m is a multilingual vision-language encoder for zero-shot image classification and image-text retrieval.

est. price
~$0.235
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
521.6K
license
apache-2.0

specs

TaskZero-Shot Image Classification & Image-Text Retrieval
ArchitectureSigLIP 2 (Vision Transformer, So400m)
Parameters400M
Input Resolution256x256
Training DataWebLI dataset

about this model

google/siglip2-so400m-patch16-256 is a zero-shot image classification and vision-language encoder model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation for improved semantic understanding, localization, and dense feature extraction.

The model is pre-trained on the WebLI dataset using up to 2048 TPU-v5e chips. It supports multiple resolutions and preserves the native aspect ratio of input images, and the training recipe incorporates de-biasing techniques for better multilingual understanding and fairness.

Key Capabilities

  • Zero-shot image classification and image-text retrieval
  • Vision encoder for VLMs and other vision tasks
  • Localization and dense prediction tasks benefit from the new training objectives (decoder loss, global-local masked prediction loss)

Benchmark Results

SigLIP 2 models outperform their SigLIP counterparts at all model scales in zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). The So400m (400M parameter) variant delivers significant improvements on localization and dense prediction tasks.

Task / MetricPerformance vs. SigLIP (same scale)
Zero-shot classificationImproved
Image-text retrievalImproved
VLM visual transferImproved
Localization & dense predictionSignificantly improved

Evaluation results table from SigLIP 2 paper showing performance across multiple vision-language benchmarks

For further details, refer to the SigLIP 2 paper.

best for

FAQ

What tasks is SigLIP 2 So400m best for?

It excels at zero-shot image classification, image-text retrieval, and as a vision encoder for VLMs, with improved multilingual understanding and dense features.

What is the input format for the API?

The API accepts images (URL or base64) and candidate labels (for classification) or text queries (for retrieval). Use the gigarouter OpenAI-compatible endpoint with an API key.

How large is the model and what are its compute requirements?

It has 400M parameters and was trained on up to 2048 TPU-v5e chips. Inference is efficient on modern GPUs or via the hosted API.

What datasets was SigLIP 2 So400m trained on?

It was pre-trained on the WebLI dataset, a large multilingual image-text corpus.

How do I call this model via the gigarouter API?

Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name "google/siglip2-so400m-patch16-256" and providing image and text inputs as required.

not yet live

We're benchmarking and onboarding SigLIP 2 So400m as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →