SigLIP 2 Base
google/siglip2-base-patch16-224
published Feb 2025 · updated Feb 2025
SigLIP 2 Base is a zero-shot-image model that extends the SigLIP pretraining objective with captioning, self-supervised losses, and online data curation for improved semantic understanding, localization, and dense features.
specs
| Task | Zero-shot image classification, image-text retrieval, vision encoder |
| Architecture | ViT-B (Vision Transformer Base) |
| Parameters | 86M (ViT-B variant) |
| License | Apache-2.0 |
about this model
Capabilities
SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including:
- Zero-shot classification
- Image-text retrieval
- Transfer performance when extracting visual representations for Vision-Language Models (VLMs)
- Localization and dense prediction tasks
The training recipe includes variants that support multiple resolutions and preserve the input's native aspect ratio. The model was trained on a diverse data-mixture with de-biasing techniques, leading to improved multilingual understanding and fairness.
Training details
SigLIP 2 was pre-trained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. The training objectives include decoder loss, global-local and masked prediction loss, and aspect ratio and resolution adaptability.
Evaluation results
For detailed benchmark results, refer to the evaluation table from the SigLIP 2 paper (Tschannen et al., 2025).
best for
- ·Zero-shot image classification with custom candidate labels
- ·Image-text retrieval and similarity search
- ·Vision encoder for multimodal models
FAQ
The model accepts images and text labels. For zero-shot classification, provide an image and a list of candidate labels. For retrieval, encode images and texts separately and compare embeddings.
SigLIP 2 adds captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation, leading to better zero-shot classification, retrieval, localization, and dense prediction tasks.
The model is released under the Apache-2.0 license.
Use the gigarouter OpenAI-compatible endpoint with your API key. The endpoint accepts image URLs or base64-encoded images and candidate labels, returning classification scores or embeddings.
Yes, the training recipe includes variants that support multiple resolutions and preserve the input's native aspect ratio.
We're benchmarking and onboarding SigLIP 2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.