skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP 2 So400m

google/siglip2-so400m-patch16-512

published Feb 2025 · updated Feb 2025

SigLIP 2 So400m is a zero-shot-image model that extends the SigLIP objective with captioning, self-supervised losses, and online data curation for improved semantic understanding, localization, and dense features.

est. price
~$0.235
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
312.9K
license
apache-2.0

specs

TaskZero-shot image classification, image-text retrieval, vision encoder
ArchitectureSigLIP 2 (Vision Transformer based)
Parameters1.14B (So400m variant)
LicenseApache 2.0

about this model

google/siglip2-so400m-patch16-512 is a zero-shot image classification and vision-language encoder model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation into a unified recipe for improved semantic understanding, localization, and dense features.

This model, part of the SigLIP 2 family, is the So400m variant with approximately 1.14 billion parameters. It is licensed under Apache 2.0. The model supports zero-shot image classification and image-text retrieval, and can serve as a vision encoder for Vision-Language Models (VLMs). It was pretrained on the WebLI dataset using up to 2048 TPU-v5e chips.

Key Strengths

  • Improved performance over original SigLIP across all model scales in zero-shot classification, image-text retrieval, and transfer for VLMs.
  • Significant improvements on localization and dense prediction tasks.
  • Supports multiple resolutions and preserves the input's native aspect ratio.
  • Trained on a diverse data-mixture with de-biasing techniques for better multilingual understanding and improved fairness.

Evaluation Results

The following table from the SigLIP 2 paper shows evaluation results for the model family:

Evaluation results table for SigLIP 2 model family showing zero-shot classification, retrieval, and localization metrics across ViT-B, L, So400m, and g sizes

Additional Details

  • The So400m variant has 1,136,556,698 parameters (F32).
  • The full model family includes four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
  • Training incorporated captioning-based pretraining, self-distillation, masked prediction losses, and online data curation.
  • Variants support multiple resolutions and preserve the input's native aspect ratio.
  • Training data included de-biasing techniques for improved multilingual understanding and fairness.

For further details, refer to the SigLIP 2 paper and the SigLIP documentation.

best for

FAQ

What tasks is SigLIP 2 So400m best suited for?

It excels at zero-shot image classification, image-text retrieval, and as a vision encoder for VLMs, with improvements in localization and dense prediction tasks.

How large is the So400m model in terms of parameters?

The So400m variant has approximately 1.14 billion parameters, confirmed via Hugging Face safetensors metadata.

What is the license for using this model?

The model is licensed under Apache 2.0, as indicated by the Hugging Face model page tags.

What are the input and output formats for the API?

Input: images (URL or base64) and text candidate labels; output: classification scores or embeddings. For retrieval, image and text pairs. Use the gigarouter OpenAI-compatible endpoint with an API key.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. The endpoint URL and model name will be provided in your gigarouter dashboard. Send a request with image and text inputs as per the zero-shot-image classification schema.

not yet live

We're benchmarking and onboarding SigLIP 2 So400m as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →