SigLIP Base Patch16 384
google/siglip-base-patch16-384
published Jan 2024 · updated Sep 2024
SigLIP Base Patch16 384 is a zero-shot-image model that uses a sigmoid loss function for language-image pre-training, enabling tasks like zero-shot image classification and image-text retrieval.
specs
| Task | Zero-shot image classification, image-text retrieval |
| Architecture | CLIP-based multimodal model with Vision Transformer (ViT) and text encoder |
| Parameters | 0.2B |
| License | Not specified in model card |
about this model
google/siglip-base-patch16-384 is a zero-shot image classification and image-text retrieval model that uses a sigmoid loss for language-image pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates on individual image-text pairs and does not require a global view of pairwise similarities for normalization. This design allows the model to scale to larger batch sizes while maintaining strong performance even at smaller batch sizes.
Model Architecture and Training
The model is a multimodal vision-language model based on the CLIP architecture with a ViT-B/16 image encoder and a text encoder. It has 0.2 billion parameters and was pre-trained on the English image-text pairs of the WebLI dataset at a resolution of 384x384 pixels. Images are rescaled and normalized across RGB channels (mean 0.5, std 0.5); text is tokenized and padded to 64 tokens. Training used 16 TPU-v4 chips over three days.
Benchmark Performance
The paper introducing SigLIP reports that a related variant with a locked image tower (SigLiT) achieves 84.5% ImageNet zero-shot accuracy using only 4 TPUv4 chips in two days. The following comparison, taken from the paper, illustrates SigLIP’s zero-shot accuracy relative to CLIP across multiple datasets:

best for
- ·Zero-shot image classification without task-specific training
- ·Image-text retrieval and similarity scoring
FAQ
SigLIP uses a sigmoid loss that operates on individual image-text pairs, removing the need for global pairwise similarity normalization and enabling better performance at smaller batch sizes.
Images are resized to 384x384 and normalized with mean 0.5 and std 0.5. Text is tokenized and padded to 64 tokens.
Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and candidate text labels for zero-shot classification.
It has 0.2 billion parameters.
The base SigLIP model was trained on 16 TPU-v4 chips for three days.
We're benchmarking and onboarding SigLIP Base Patch16 384 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.