SigLIP 2 Giant

google/siglip2-giant-opt-patch16-384

published Feb 2025 · updated Feb 2025

SigLIP 2 Giant is a zero-shot-image model that extends SigLIP with improved semantic understanding, localization, and dense features for tasks like zero-shot classification and image-text retrieval.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

1.5M

license

apache-2.0

specs

Task	Zero-Shot Image Classification, Image-Text Retrieval, Vision Encoder
Architecture	ViT-giant/16 with OPT decoder, patch size 16, 384 resolution
Parameters	1 billion
Input Resolution	384x384
Training Data	WebLI dataset

about this model

google/siglip2-giant-opt-patch16-384 is a vision-language encoder model specialized for zero-shot image classification, image-text retrieval, and as a visual backbone for vision-language models (VLMs). It is the giant (1B parameter) variant of SigLIP 2, built on an OPT decoder with patch size 16 and 384 resolution, using a sigmoid loss training objective. The model extends the original SigLIP with a unified recipe that combines captioning-based pretraining, self-supervised losses (self-distillation, masked prediction), and online data curation, resulting in improved semantic understanding, localization, and dense feature extraction.

Key Improvements

Enhanced zero-shot classification and retrieval performance across all model scales compared to SigLIP.
Significant gains on localization and dense prediction tasks.
Support for multiple resolutions and native aspect ratio preservation.
Multilingual understanding and improved fairness through de-biased training on the WebLI dataset (over 100 languages).

Benchmark Performance

The evaluation results from the SigLIP 2 paper are shown below. For context, the So400m variant at 384 resolution achieves 84.1% ImageNet zero-shot accuracy, 56.0 COCO text-to-image R@1, and 71.2 COCO image-to-text R@1.

SigLIP 2 evaluation results table comparing zero-shot classification and retrieval metrics across model variants

Training Details

Pre-trained on the WebLI dataset using up to 2048 TPU-v5e chips. The model uses a Gemma tokenizer with a 256k vocabulary. Checkpoints are available via Hugging Face and in npz format from Google Cloud Storage. No explicit license is specified for the model weights.

best for

·Zero-shot image classification with arbitrary candidate labels
·Image-text retrieval (e.g., finding images from text descriptions)
·Vision encoder for Vision-Language Models (VLMs)
·Localization and dense prediction tasks

FAQ

What is the main improvement of SigLIP 2 over the original SigLIP?

SigLIP 2 adds captioning-based pretraining, self-supervised losses, and online data curation, leading to better zero-shot, retrieval, and dense prediction performance.

How many parameters does the Giant variant have?

1 billion parameters.

What input format does the model expect?

It expects images and candidate labels; images can be provided as URLs or loaded into a PIL image.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with an API key, sending an image and candidate labels for zero-shot classification.

What license is the model released under?

The model card does not specify a license; check the paper and repository for details.

not yet live

We're benchmarking and onboarding SigLIP 2 Giant as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336