SigLIP 2 So400m

google/siglip2-so400m-patch16-naflex

published Feb 2025 · updated Feb 2025

SigLIP 2 So400m is a zero-shot-image model that extends SigLIP with improved semantic understanding, localization, and dense features for multilingual vision-language tasks.

est. price

~$0.235

/ 1k images · estimated, set at launch

API providers

downloads / mo

732.4K

license

apache-2.0

specs

Task	Zero-shot image classification, image-text retrieval, vision encoder
Architecture	SigLIP 2 Vision Transformer (ViT) So400m/16 patch16 NaFlex with native aspect ratio and multiple resolution support
Parameters	400 million

about this model

google/siglip2-so400m-patch16-naflex is a zero-shot image classification and vision-language encoding model that extends the SigLIP pretraining objective with captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and adaptive resolution handling for improved semantic understanding, localization, and dense features. It supports multiple resolutions while preserving the input's native aspect ratio via the NaFlex (non-uniform aspect ratio flexible) variant.

SigLIP 2 models are trained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. The So400m/16 variant contains 400 million parameters and uses a Gemma tokenizer with a 256k vocabulary. The model can be used for zero-shot image classification, image-text retrieval, or as a vision encoder for vision-language models (VLMs).

Evaluation Results

The following table from the SigLIP 2 paper reports zero-shot ImageNet accuracy and COCO retrieval scores for the So400m/16 variant at different input resolutions:

Zero-shot ImageNet accuracy: 83.4% at 256px, 84.1% at 384px, 84.3% at 512px.
COCO Text→Image retrieval: 56.0 at 512px.
COCO Image→Text retrieval: 71.3 at 512px.

Compared to the original SigLIP, SigLIP 2 shows consistent improvements across zero-shot classification, retrieval, and dense prediction tasks, with stronger multilingual understanding and improved fairness due to de-biasing techniques in the training data mixture.

best for

·Zero-shot classification of images using natural language labels
·Multilingual image-text retrieval across diverse domains
·Vision backbone for Vision-Language Models (VLMs)

FAQ

What is the parameter count of SigLIP 2 So400m?

It has 400 million parameters.

What input resolutions does the NaFlex variant support?

The NaFlex variant supports multiple resolutions including 224px, 256px, 384px, and 512px while preserving native aspect ratio.

How does SigLIP 2 So400m compare to the original SigLIP?

SigLIP 2 outperforms SigLIP at all scales on zero-shot classification, image-text retrieval, and dense prediction tasks.

How can I use this model via the gigarouter API?

Call the gigarouter OpenAI-compatible endpoint with your API key and appropriate input format.

What is the training dataset for SigLIP 2?

It is pre-trained on the WebLI dataset (Chen et al., 2023).

not yet live

We're benchmarking and onboarding SigLIP 2 So400m as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336