SigLIP 2 Base

google/siglip2-base-patch16-224

published Feb 2025 · updated Feb 2025

SigLIP 2 Base is a zero-shot-image model that extends the SigLIP pretraining objective with captioning, self-supervised losses, and online data curation for improved semantic understanding, localization, and dense features.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

408.5K

license

apache-2.0

specs

Task	Zero-shot image classification, image-text retrieval, vision encoder
Architecture	ViT-B (Vision Transformer Base)
Parameters	86M (ViT-B variant)
License	Apache-2.0

about this model

Siglip2-base-patch16-224 is a zero-shot image classification model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation into a unified recipe for improved semantic understanding, localization, and dense features. The model is part of the SigLIP 2 family, available in four sizes: ViT-B (86M parameters), L (303M), So400m (400M), and g (1B). This base variant has 375,187,970 parameters (F32) and is released under the Apache-2.0 license.

Capabilities

SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including:

Zero-shot classification
Image-text retrieval
Transfer performance when extracting visual representations for Vision-Language Models (VLMs)
Localization and dense prediction tasks

The training recipe includes variants that support multiple resolutions and preserve the input's native aspect ratio. The model was trained on a diverse data-mixture with de-biasing techniques, leading to improved multilingual understanding and fairness.

Training details

SigLIP 2 was pre-trained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. The training objectives include decoder loss, global-local and masked prediction loss, and aspect ratio and resolution adaptability.

Evaluation results

Evaluation table comparing SigLIP 2 performance across benchmarks

For detailed benchmark results, refer to the evaluation table from the SigLIP 2 paper (Tschannen et al., 2025).

best for

·Zero-shot image classification with custom candidate labels
·Image-text retrieval and similarity search
·Vision encoder for multimodal models

FAQ

What is the input format for this model?

The model accepts images and text labels. For zero-shot classification, provide an image and a list of candidate labels. For retrieval, encode images and texts separately and compare embeddings.

How does SigLIP 2 Base compare to the original SigLIP?

SigLIP 2 adds captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation, leading to better zero-shot classification, retrieval, localization, and dense prediction tasks.

What is the license for this model?

The model is released under the Apache-2.0 license.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. The endpoint accepts image URLs or base64-encoded images and candidate labels, returning classification scores or embeddings.

Does this model support multiple resolutions and aspect ratios?

Yes, the training recipe includes variants that support multiple resolutions and preserve the input's native aspect ratio.

not yet live

We're benchmarking and onboarding SigLIP 2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336