SigLIP 2 Base
google/siglip2-base-patch16-512
published Feb 2025 · updated Feb 2025
SigLIP 2 Base is a zero-shot-image model that extends SigLIP with improved semantic understanding, localization, and dense features for tasks like zero-shot classification and image-text retrieval.
specs
| Task | Zero-Shot Image Classification, Image-Text Retrieval, Vision Encoder |
| Architecture | ViT-B/16 (patch size 16) |
| Parameters | 86M |
| Input Resolution | 512x512 pixels |
| Training Data | WebLI dataset |
| Training Compute | Up to 2048 TPU-v5e chips |
about this model
google/siglip2-base-patch16-512 is a zero-shot image classification model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation to improve semantic understanding, localization, and dense feature extraction. It accepts 512×512 pixel inputs with patch size 16 and contains 86M parameters (ViT-B/16 architecture). The model is hosted as a managed, OpenAI-compatible API on Gigarouter, requiring no local installation or infrastructure.
Key Strengths
The unified training recipe yields substantial gains over the original SigLIP across core tasks including zero-shot classification, image-text retrieval, and transfer performance for Vision-Language Models (VLMs). It also delivers significant improvements on localization and dense prediction tasks. The training data mixture incorporates de-biasing techniques, resulting in better multilingual understanding and improved fairness.
Benchmark Performance
On standard evaluations (from the official GitHub README and paper):
- ImageNet zero-shot top-1 accuracy: 81.2%
- COCO text-to-image retrieval (Recall@1): 55.2%
- COCO image-to-text retrieval (Recall@1): 71.2%
The model also supports a 256k-vocabulary Gemma tokenizer (distinct from the original SigLIP tokenizer). A separate NaFlex variant is available for variable aspect ratios and multiple resolutions.

Training Details
Pretrained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. The model is one of four released scales (ViT-B 86M, L 303M, So400m 400M, g 1B). For complete methodology, refer to the SigLIP 2 paper.
best for
- ·Zero-shot image classification with custom label sets
- ·Image-text retrieval (search images by text or vice versa)
- ·As a vision encoder for Vision-Language Models (VLMs)
FAQ
It excels at zero-shot image classification, image-text retrieval, and as a vision encoder for VLMs, with improved localization and dense prediction over SigLIP.
SigLIP 2 Base adds captioning pretraining, self-supervised losses, and online data curation, achieving higher zero-shot accuracy (81.2% on ImageNet) and better retrieval scores.
The model takes images resized to 512x512 pixels and text (for retrieval/classification) using the Gemma tokenizer with a 256k vocabulary.
Use the gigarouter OpenAI-compatible endpoint with your API key. Refer to the gigarouter documentation for endpoint details and request format.
The model card does not specify an open-source license; the checkpoints are hosted under Google research terms. Check the official repository for the latest licensing information.
We're benchmarking and onboarding SigLIP 2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.