SigLIP 2 So400m
google/siglip2-so400m-patch16-256
published Feb 2025 · updated Feb 2025
SigLIP 2 So400m is a multilingual vision-language encoder for zero-shot image classification and image-text retrieval.
specs
| Task | Zero-Shot Image Classification & Image-Text Retrieval |
| Architecture | SigLIP 2 (Vision Transformer, So400m) |
| Parameters | 400M |
| Input Resolution | 256x256 |
| Training Data | WebLI dataset |
about this model
google/siglip2-so400m-patch16-256 is a zero-shot image classification and vision-language encoder model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation for improved semantic understanding, localization, and dense feature extraction.
The model is pre-trained on the WebLI dataset using up to 2048 TPU-v5e chips. It supports multiple resolutions and preserves the native aspect ratio of input images, and the training recipe incorporates de-biasing techniques for better multilingual understanding and fairness.
Key Capabilities
- Zero-shot image classification and image-text retrieval
- Vision encoder for VLMs and other vision tasks
- Localization and dense prediction tasks benefit from the new training objectives (decoder loss, global-local masked prediction loss)
Benchmark Results
SigLIP 2 models outperform their SigLIP counterparts at all model scales in zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). The So400m (400M parameter) variant delivers significant improvements on localization and dense prediction tasks.
| Task / Metric | Performance vs. SigLIP (same scale) |
|---|---|
| Zero-shot classification | Improved |
| Image-text retrieval | Improved |
| VLM visual transfer | Improved |
| Localization & dense prediction | Significantly improved |
For further details, refer to the SigLIP 2 paper.
best for
- ·Zero-shot image classification with custom candidate labels
- ·Image-text retrieval across multiple languages
- ·Vision encoder for Vision-Language Models (VLMs)
FAQ
It excels at zero-shot image classification, image-text retrieval, and as a vision encoder for VLMs, with improved multilingual understanding and dense features.
The API accepts images (URL or base64) and candidate labels (for classification) or text queries (for retrieval). Use the gigarouter OpenAI-compatible endpoint with an API key.
It has 400M parameters and was trained on up to 2048 TPU-v5e chips. Inference is efficient on modern GPUs or via the hosted API.
It was pre-trained on the WebLI dataset, a large multilingual image-text corpus.
Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key, specifying the model name "google/siglip2-so400m-patch16-256" and providing image and text inputs as required.
We're benchmarking and onboarding SigLIP 2 So400m as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.