skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP 2 Base

google/siglip2-base-patch16-224

published Feb 2025 · updated Feb 2025

SigLIP 2 Base is a zero-shot-image model that extends the SigLIP pretraining objective with captioning, self-supervised losses, and online data curation for improved semantic understanding, localization, and dense features.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
408.5K
license
apache-2.0

specs

TaskZero-shot image classification, image-text retrieval, vision encoder
ArchitectureViT-B (Vision Transformer Base)
Parameters86M (ViT-B variant)
LicenseApache-2.0

about this model

Siglip2-base-patch16-224 is a zero-shot image classification model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation into a unified recipe for improved semantic understanding, localization, and dense features. The model is part of the SigLIP 2 family, available in four sizes: ViT-B (86M parameters), L (303M), So400m (400M), and g (1B). This base variant has 375,187,970 parameters (F32) and is released under the Apache-2.0 license.

Capabilities

SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including:

  • Zero-shot classification
  • Image-text retrieval
  • Transfer performance when extracting visual representations for Vision-Language Models (VLMs)
  • Localization and dense prediction tasks

The training recipe includes variants that support multiple resolutions and preserve the input's native aspect ratio. The model was trained on a diverse data-mixture with de-biasing techniques, leading to improved multilingual understanding and fairness.

Training details

SigLIP 2 was pre-trained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. The training objectives include decoder loss, global-local and masked prediction loss, and aspect ratio and resolution adaptability.

Evaluation results

Evaluation table comparing SigLIP 2 performance across benchmarks

For detailed benchmark results, refer to the evaluation table from the SigLIP 2 paper (Tschannen et al., 2025).

best for

FAQ

What is the input format for this model?

The model accepts images and text labels. For zero-shot classification, provide an image and a list of candidate labels. For retrieval, encode images and texts separately and compare embeddings.

How does SigLIP 2 Base compare to the original SigLIP?

SigLIP 2 adds captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation, leading to better zero-shot classification, retrieval, localization, and dense prediction tasks.

What is the license for this model?

The model is released under the Apache-2.0 license.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. The endpoint accepts image URLs or base64-encoded images and candidate labels, returning classification scores or embeddings.

Does this model support multiple resolutions and aspect ratios?

Yes, the training recipe includes variants that support multiple resolutions and preserve the input's native aspect ratio.

not yet live

We're benchmarking and onboarding SigLIP 2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →