skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP Base Patch16 256

google/siglip-base-patch16-256

published Jan 2024 · updated Sep 2024

SigLIP Base Patch16 256 is a zero-shot-image model that uses a sigmoid loss function for language-image pre-training, enabling efficient image classification and retrieval without requiring global pairwise similarity normalization.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
26.9K
license
apache-2.0

specs

TaskZero-shot image classification, image-text retrieval
ArchitectureSigLIP (CLIP-like multimodal model with sigmoid loss), base-sized, patch size 16, input resolution 256x256
LicenseNot specified in model card

about this model

google/siglip-base-patch16-256 is a zero-shot image classification model that performs language-image pre-training using a sigmoid loss function, enabling robust zero-shot classification and retrieval without requiring global pairwise similarity normalization.

How it works

Unlike standard contrastive learning with softmax normalization, SigLIP computes a pairwise sigmoid loss on each image-text pair independently. This decouples the loss from batch size, allowing effective training at both small and large batch sizes. The model was pre-trained on the WebLI dataset (English image-text pairs) at 256x256 resolution. Text inputs are tokenized to 64 tokens; images are resized and normalized with mean 0.5 and standard deviation 0.5 per channel.

Training and compute

The model was trained on 16 TPU-v4 chips over three days. The paper demonstrates that a batch size of 32k is sufficient, with diminishing returns up to 1 million. The sigmoid loss simultaneously supports scaling to larger batches and performs better at smaller ones.

Benchmark performance

The underlying approach achieved strong zero-shot results: the SigLiT variant (SigLIP combined with Locked-image Tuning) attained 84.5% top-1 accuracy on ImageNet zero-shot classification, trained on only 4 TPUv4 chips in two days. The evaluation comparison between SigLIP and CLIP from the original paper is shown below.

Evaluation results comparing SigLIP to CLIP across multiple tasks, showing SigLIP outperforming CLIP on several benchmarks.

Additional details

The model was presented as an Oral paper at ICCV 2023. It was introduced in Zhai et al., Sigmoid Loss for Language Image Pre-Training (arXiv:2303.15343), and the pre-trained weights are open-sourced via Google Research’s big_vision repository (Apache 2.0 license).

best for

FAQ

What is the main advantage of SigLIP over CLIP?

SigLIP uses a sigmoid loss that operates on individual image-text pairs, removing the need for global pairwise similarity normalization. This allows scaling to larger batch sizes and performs better at smaller batch sizes compared to CLIP.

What input format does the model expect?

Images are resized to 256x256 pixels and normalized with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). Text is tokenized and padded to 64 tokens.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Send an image URL or base64-encoded image along with candidate text labels to perform zero-shot classification.

What is the license for this model?

The model card does not specify a license. The associated big_vision repository typically uses Apache 2.0, but this is not confirmed for the model itself.

What batch size was used during training?

The model card does not specify the batch size used for this model. The paper recommends a batch size of 32k as sufficient for SigLIP training.

not yet live

We're benchmarking and onboarding SigLIP Base Patch16 256 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →