SigLIP 2 Large

google/siglip2-large-patch16-256

published Feb 2025 · updated Feb 2025

SigLIP 2 Large is a zero-shot-image model that builds on SigLIP with improved semantic understanding, localization, and dense features, enabling zero-shot classification and image-text retrieval.

est. price

~$0.235

/ 1k images · estimated, set at launch

API providers

downloads / mo

100.8K

license

apache-2.0

specs

Task	Zero-shot image classification, image-text retrieval, vision encoder
Architecture	Vision Transformer (ViT-L/16) with SigLIP 2 training
Parameters	303M

about this model

Google SigLIP 2 Large (patch16-256) is a zero-shot image classification and retrieval model that extends the original SigLIP with a unified training recipe incorporating captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation. It is a multilingual vision-language encoder with 303 million parameters (ViT-L/16 at 256px resolution).

Key capabilities

The model supports zero-shot image classification, image-text retrieval, and can serve as a vision encoder for Vision-Language Models (VLMs). It is trained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. SigLIP 2 improves semantic understanding, localization, and dense feature extraction over the original SigLIP across all model scales. The model uses the Gemma tokenizer with a vocabulary size of 256k.

Benchmark results

For the large-patch16-256 variant, reported zero-shot and retrieval performance (from the paper's evaluation) includes:

Task	Metric	Score
ImageNet zero-shot classification	Top-1 accuracy	82.5%
COCO text-to-image retrieval	Recall@1	54.7
COCO image-to-text retrieval	Recall@1	71.5

These figures are taken from the official big_vision README for SigLIP 2.

Evaluation table from the paper

Evaluation table comparing SigLIP 2 against prior models across multiple benchmarks.

Additional details

This model is part of a family of four released sizes (ViT-B, L, So400m, g). SigLIP 2 variants also include NaFlex checkpoints supporting multiple resolutions and native aspect ratio (custom implementation required). The model is hosted on gigarouter as a managed, OpenAI-compatible API — no local installation or infrastructure management needed.

best for

·Zero-shot image classification on arbitrary label sets
·Image-text retrieval (text-to-image and image-to-text)
·Extracting visual representations for Vision-Language Models

FAQ

What is SigLIP 2 Large?

It is a multilingual vision-language encoder built on SigLIP with improved semantic understanding, localization, and dense features.

How many parameters does this model have?

The large variant has 303M parameters (ViT-L/16).

What are the benchmark results?

It achieves 82.5% ImageNet zero-shot accuracy, 54.7 COCO text-to-image R@1, and 71.5 COCO image-to-text R@1.

What input format does the model expect?

It accepts images and candidate labels as text; use the zero-shot-image-classification pipeline or encode images separately via get_image_features.

How can I use this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key to send requests for zero-shot classification or embedding extraction.

not yet live

We're benchmarking and onboarding SigLIP 2 Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336