SigLIP 2 Large
google/siglip2-large-patch16-256
published Feb 2025 · updated Feb 2025
SigLIP 2 Large is a zero-shot-image model that builds on SigLIP with improved semantic understanding, localization, and dense features, enabling zero-shot classification and image-text retrieval.
specs
| Task | Zero-shot image classification, image-text retrieval, vision encoder |
| Architecture | Vision Transformer (ViT-L/16) with SigLIP 2 training |
| Parameters | 303M |
about this model
Google SigLIP 2 Large (patch16-256) is a zero-shot image classification and retrieval model that extends the original SigLIP with a unified training recipe incorporating captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation. It is a multilingual vision-language encoder with 303 million parameters (ViT-L/16 at 256px resolution).
Key capabilities
The model supports zero-shot image classification, image-text retrieval, and can serve as a vision encoder for Vision-Language Models (VLMs). It is trained on the WebLI dataset (Chen et al., 2023) using up to 2048 TPU-v5e chips. SigLIP 2 improves semantic understanding, localization, and dense feature extraction over the original SigLIP across all model scales. The model uses the Gemma tokenizer with a vocabulary size of 256k.
Benchmark results
For the large-patch16-256 variant, reported zero-shot and retrieval performance (from the paper's evaluation) includes:
| Task | Metric | Score |
|---|---|---|
| ImageNet zero-shot classification | Top-1 accuracy | 82.5% |
| COCO text-to-image retrieval | Recall@1 | 54.7 |
| COCO image-to-text retrieval | Recall@1 | 71.5 |
These figures are taken from the official big_vision README for SigLIP 2.
Evaluation table from the paper
Additional details
This model is part of a family of four released sizes (ViT-B, L, So400m, g). SigLIP 2 variants also include NaFlex checkpoints supporting multiple resolutions and native aspect ratio (custom implementation required). The model is hosted on gigarouter as a managed, OpenAI-compatible API — no local installation or infrastructure management needed.
best for
- ·Zero-shot image classification on arbitrary label sets
- ·Image-text retrieval (text-to-image and image-to-text)
- ·Extracting visual representations for Vision-Language Models
FAQ
It is a multilingual vision-language encoder built on SigLIP with improved semantic understanding, localization, and dense features.
The large variant has 303M parameters (ViT-L/16).
It achieves 82.5% ImageNet zero-shot accuracy, 54.7 COCO text-to-image R@1, and 71.5 COCO image-to-text R@1.
It accepts images and candidate labels as text; use the zero-shot-image-classification pipeline or encode images separately via get_image_features.
Use the OpenAI-compatible endpoint with your API key to send requests for zero-shot classification or embedding extraction.
We're benchmarking and onboarding SigLIP 2 Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.