SigLIP 2 Base

google/siglip2-base-patch16-512

published Feb 2025 · updated Feb 2025

SigLIP 2 Base is a zero-shot-image model that extends SigLIP with improved semantic understanding, localization, and dense features for tasks like zero-shot classification and image-text retrieval.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

123.9K

license

apache-2.0

specs

Task	Zero-Shot Image Classification, Image-Text Retrieval, Vision Encoder
Architecture	ViT-B/16 (patch size 16)
Parameters	86M
Input Resolution	512x512 pixels
Training Data	WebLI dataset
Training Compute	Up to 2048 TPU-v5e chips

about this model

google/siglip2-base-patch16-512 is a zero-shot image classification model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation to improve semantic understanding, localization, and dense feature extraction. It accepts 512×512 pixel inputs with patch size 16 and contains 86M parameters (ViT-B/16 architecture). The model is hosted as a managed, OpenAI-compatible API on Gigarouter, requiring no local installation or infrastructure.

Key Strengths

The unified training recipe yields substantial gains over the original SigLIP across core tasks including zero-shot classification, image-text retrieval, and transfer performance for Vision-Language Models (VLMs). It also delivers significant improvements on localization and dense prediction tasks. The training data mixture incorporates de-biasing techniques, resulting in better multilingual understanding and improved fairness.

Benchmark Performance

On standard evaluations (from the official GitHub README and paper):

ImageNet zero-shot top-1 accuracy: 81.2%
COCO text-to-image retrieval (Recall@1): 55.2%
COCO image-to-text retrieval (Recall@1): 71.2%

The model also supports a 256k-vocabulary Gemma tokenizer (distinct from the original SigLIP tokenizer). A separate NaFlex variant is available for variable aspect ratios and multiple resolutions.

Evaluation table from the SigLIP 2 paper comparing zero-shot classification, retrieval, and localization metrics across model scales

Training Details

Pretrained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. The model is one of four released scales (ViT-B 86M, L 303M, So400m 400M, g 1B). For complete methodology, refer to the SigLIP 2 paper.

best for

·Zero-shot image classification with custom label sets
·Image-text retrieval (search images by text or vice versa)
·As a vision encoder for Vision-Language Models (VLMs)

FAQ

What tasks is SigLIP 2 Base best for?

It excels at zero-shot image classification, image-text retrieval, and as a vision encoder for VLMs, with improved localization and dense prediction over SigLIP.

How does SigLIP 2 Base compare to the original SigLIP Base?

SigLIP 2 Base adds captioning pretraining, self-supervised losses, and online data curation, achieving higher zero-shot accuracy (81.2% on ImageNet) and better retrieval scores.

What input does the model expect?

The model takes images resized to 512x512 pixels and text (for retrieval/classification) using the Gemma tokenizer with a 256k vocabulary.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Refer to the gigarouter documentation for endpoint details and request format.

What license applies to SigLIP 2 Base?

The model card does not specify an open-source license; the checkpoints are hosted under Google research terms. Check the official repository for the latest licensing information.

not yet live

We're benchmarking and onboarding SigLIP 2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336