skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP 2 Large

google/siglip2-large-patch16-256

published Feb 2025 · updated Feb 2025

SigLIP 2 Large is a zero-shot-image model that builds on SigLIP with improved semantic understanding, localization, and dense features, enabling zero-shot classification and image-text retrieval.

est. price
~$0.235
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
100.8K
license
apache-2.0

specs

TaskZero-shot image classification, image-text retrieval, vision encoder
ArchitectureVision Transformer (ViT-L/16) with SigLIP 2 training
Parameters303M

about this model

Google SigLIP 2 Large (patch16-256) is a zero-shot image classification and retrieval model that extends the original SigLIP with a unified training recipe incorporating captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation. It is a multilingual vision-language encoder with 303 million parameters (ViT-L/16 at 256px resolution).

Key capabilities

The model supports zero-shot image classification, image-text retrieval, and can serve as a vision encoder for Vision-Language Models (VLMs). It is trained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. SigLIP 2 improves semantic understanding, localization, and dense feature extraction over the original SigLIP across all model scales. The model uses the Gemma tokenizer with a vocabulary size of 256k.

Benchmark results

For the large-patch16-256 variant, reported zero-shot and retrieval performance (from the paper's evaluation) includes:

TaskMetricScore
ImageNet zero-shot classificationTop-1 accuracy82.5%
COCO text-to-image retrievalRecall@154.7
COCO image-to-text retrievalRecall@171.5

These figures are taken from the official big_vision README for SigLIP 2.

Evaluation table from the paper

Evaluation table comparing SigLIP 2 against prior models across multiple benchmarks.

Additional details

This model is part of a family of four released sizes (ViT-B, L, So400m, g). SigLIP 2 variants also include NaFlex checkpoints supporting multiple resolutions and native aspect ratio (custom implementation required). The model is hosted on gigarouter as a managed, OpenAI-compatible API — no local installation or infrastructure management needed.

best for

FAQ

What is SigLIP 2 Large?

It is a multilingual vision-language encoder built on SigLIP with improved semantic understanding, localization, and dense features.

How many parameters does this model have?

The large variant has 303M parameters (ViT-L/16).

What are the benchmark results?

It achieves 82.5% ImageNet zero-shot accuracy, 54.7 COCO text-to-image R@1, and 71.5 COCO image-to-text R@1.

What input format does the model expect?

It accepts images and candidate labels as text; use the zero-shot-image-classification pipeline or encode images separately via get_image_features.

How can I use this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key to send requests for zero-shot classification or embedding extraction.

not yet live

We're benchmarking and onboarding SigLIP 2 Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →