SigLIP Base (Patch16-224)
google/siglip-base-patch16-224
published Sep 2023 · updated Sep 2024
SigLIP Base (Patch16-224) is a zero-shot image model that uses a sigmoid loss function for language-image pre-training, enabling tasks like zero-shot classification and image-text retrieval without requiring a global view of pairwise similarities.
specs
| Task | Zero-Shot Image Classification & Image-Text Retrieval |
| Architecture | Vision Transformer (ViT-B/16) with text encoder |
| Training Data | WebLI dataset (English image-text pairs) |
| Input Resolution | 224x224 |
about this model
SigLIP (google/siglip-base-patch16-224) is a zero-shot image classification and image-text retrieval model that uses a sigmoid loss function for language-image pre-training, introduced in the ICCV 2023 Oral paper "Sigmoid Loss for Language Image Pre-Training". Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of pairwise similarities for normalization, enabling scaling to larger batch sizes while maintaining strong performance at smaller batch sizes.
Model Description
SigLIP follows the CLIP multimodal architecture but replaces the softmax-based contrastive loss with a pairwise sigmoid loss. This design decouples batch size from loss computation, allowing the model to benefit from a batch size of 32k without diminishing returns (pushed to 1 million in experiments). The base variant is pre-trained at 224x224 resolution on the English image-text pairs of the WebLI dataset.
Training Details
Images are resized to 224x224 and normalized with mean 0.5 and standard deviation 0.5 per channel. Text is tokenized and padded to a fixed length of 64 tokens. The base model was trained on 16 TPU-v4 chips for three days. A related SigLiT configuration (SigLIP with Locked-image Tuning) achieved 84.5% ImageNet zero-shot accuracy using only 4 TPUv4 chips in two days.
Benchmark Results
The following comparison of SigLIP versus CLIP is taken from the paper:
Further analysis in the paper demonstrates that the sigmoid loss maintains effectiveness at both small and large batch sizes, with optimal performance at 32k pairs.
best for
- ·Zero-shot image classification without fine-tuning
- ·Image-text similarity search and retrieval
FAQ
SigLIP uses a sigmoid loss instead of softmax, allowing operation on individual image-text pairs without global pairwise similarity normalization.
Images resized to 224x224 with normalized RGB channels (mean 0.5, std 0.5) and text tokenized to 64 tokens.
Use the OpenAI-compatible endpoint with your API key; refer to gigarouter documentation for details.
The related SigLiT model (SigLIP + Locked-image Tuning) achieves 84.5% zero-shot accuracy on ImageNet.
The base model was trained on 16 TPU-v4 chips for three days.
We're benchmarking and onboarding SigLIP Base (Patch16-224) as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.