skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP 2 Base

google/siglip2-base-patch16-naflex

published Feb 2025 · updated Feb 2025

SigLIP 2 Base is a zero-shot-image model that extends SigLIP with improved semantic understanding, localization, and dense features.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
796.3K
license
apache-2.0

specs

TaskZero-shot image classification, image-text retrieval, vision encoder
ArchitectureViT-B/16 with NaFlex (native aspect ratio and flexibility)
Parameters86M
Training DataWebLI dataset (100+ languages)

about this model

SigLIP 2 Base is a multilingual vision-language encoder for zero-shot image classification, image-text retrieval, and dense prediction tasks, hosted as a managed API on gigarouter.

Architecture and Capabilities

The model (ViT-B, 86M parameters) extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation. It uses the Gemma tokenizer with a 256k vocabulary. The NaFlex variant preserves the input's native aspect ratio and supports multiple resolutions, requiring adapted preprocessing not compatible with standard SigLIP inference code.

Training Details

Pre-trained on the WebLI dataset (Chen et al., 2023) with a diverse data mixture incorporating de-biasing techniques for improved multilingual understanding and fairness. Training used up to 2048 TPU-v5e chips.

Benchmark Results

At 224px resolution, SigLIP 2 Base achieves:

  • 78.2% ImageNet zero-shot accuracy
  • 52.1 COCO text-to-image recall
  • 68.9 COCO image-to-text recall

The model outperforms its SigLIP counterpart at the same scale across zero-shot classification, retrieval, and transfer performance for Vision-Language Models (VLMs), with significant gains on localization and dense prediction tasks.

Evaluation table from the SigLIP 2 paper comparing model variants across benchmarks

Model Variants

SigLIP 2 is released in four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B). This base model is the smallest, optimized for efficient inference.

best for

FAQ

What are the main tasks this model is best for?

Zero-shot image classification, image-text retrieval, and as a vision encoder for VLMs.

How does this model compare to the original SigLIP?

SigLIP 2 improves semantic understanding, localization, dense features, multilingual understanding, and fairness. At base size, it achieves 78.2% ImageNet zero-shot accuracy.

What is the NaFlex variant?

It preserves the input's native aspect ratio and supports multiple resolutions, unlike fixed-resolution variants.

What is the model's parameter count?

86M parameters for the base ViT-B/16 model.

How can I call this model via API?

Use the gigarouter OpenAI-compatible endpoint with an API key.

not yet live

We're benchmarking and onboarding SigLIP 2 Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →