SigLIP Base (Patch16-224)

google/siglip-base-patch16-224

published Sep 2023 · updated Sep 2024

SigLIP Base (Patch16-224) is a zero-shot image model that uses a sigmoid loss function for language-image pre-training, enabling tasks like zero-shot classification and image-text retrieval without requiring a global view of pairwise similarities.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

1.4M

license

apache-2.0

specs

Task	Zero-Shot Image Classification & Image-Text Retrieval
Architecture	Vision Transformer (ViT-B/16) with text encoder
Training Data	WebLI dataset (English image-text pairs)
Input Resolution	224x224

about this model

SigLIP (google/siglip-base-patch16-224) is a zero-shot image classification and image-text retrieval model that uses a sigmoid loss function for language-image pre-training, introduced in the ICCV 2023 Oral paper "Sigmoid Loss for Language Image Pre-Training". Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of pairwise similarities for normalization, enabling scaling to larger batch sizes while maintaining strong performance at smaller batch sizes.

Model Description

SigLIP follows the CLIP multimodal architecture but replaces the softmax-based contrastive loss with a pairwise sigmoid loss. This design decouples batch size from loss computation, allowing the model to benefit from a batch size of 32k without diminishing returns (pushed to 1 million in experiments). The base variant is pre-trained at 224x224 resolution on the English image-text pairs of the WebLI dataset.

Training Details

Images are resized to 224x224 and normalized with mean 0.5 and standard deviation 0.5 per channel. Text is tokenized and padded to a fixed length of 64 tokens. The base model was trained on 16 TPU-v4 chips for three days. A related SigLiT configuration (SigLIP with Locked-image Tuning) achieved 84.5% ImageNet zero-shot accuracy using only 4 TPUv4 chips in two days.

Benchmark Results

The following comparison of SigLIP versus CLIP is taken from the paper:

Zero-shot accuracy comparison between SigLIP and CLIP across multiple datasets

Further analysis in the paper demonstrates that the sigmoid loss maintains effectiveness at both small and large batch sizes, with optimal performance at 32k pairs.

best for

·Zero-shot image classification without fine-tuning
·Image-text similarity search and retrieval

FAQ

What is SigLIP and how does it differ from CLIP?

SigLIP uses a sigmoid loss instead of softmax, allowing operation on individual image-text pairs without global pairwise similarity normalization.

What input format does the model expect?

Images resized to 224x224 with normalized RGB channels (mean 0.5, std 0.5) and text tokenized to 64 tokens.

How can I use this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for details.

What zero-shot ImageNet accuracy does the model achieve?

The related SigLiT model (SigLIP + Locked-image Tuning) achieves 84.5% zero-shot accuracy on ImageNet.

What were the training compute requirements?

The base model was trained on 16 TPU-v4 chips for three days.

not yet live

We're benchmarking and onboarding SigLIP Base (Patch16-224) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336