SigLIP Large 384
google/siglip-large-patch16-384
published Jan 2024 · updated Sep 2024
SigLIP Large 384 is a zero-shot-image model that uses a sigmoid loss for language-image pre-training, enabling image classification and retrieval without task-specific fine-tuning.
specs
| Task | Zero-shot image classification, image-text retrieval |
| Architecture | CLIP-based multimodal model with Vision Transformer (ViT) and text encoder |
| Training Data | WebLI dataset (English image-text pairs) |
| Image Resolution | 384x384 pixels (resized and normalized) |
| Training Compute | 16 TPU-v4 chips for 3 days |
| Loss Function | Sigmoid loss (pairwise, no global normalization) |
about this model
google/siglip-large-patch16-384 is a zero-shot image classification and image-text retrieval model that extends the CLIP architecture with a pairwise sigmoid loss, eliminating the need for global pairwise similarity normalization. This design enables effective training at both small and large batch sizes, scaling up to one million examples per batch while maintaining performance with a practical batch size of 32k.
Key strengths
- Improved loss function: The sigmoid loss operates independently on each image-text pair, allowing larger batch scaling and better downstream accuracy compared to softmax-based contrastive learning.
- High resolution input: Pre-trained on 384×384 images from the WebLI dataset, balancing detail and computational efficiency.
- Proven benchmark results: The SigLiT variant (SigLIP + Locked-image Tuning) achieves 84.5% ImageNet zero-shot accuracy using only four TPUv4 chips in two days. The paper was presented as an Oral at ICCV 2023.
- Efficient training: This large-sized model was trained on 16 TPU-v4 chips over three days.
Evaluation against CLIP
The following comparison from the original paper illustrates SigLIP’s performance across multiple zero-shot benchmarks relative to CLIP:
Model background
Introduced by Zhai et al. in Sigmoid Loss for Language Image Pre-Training (2023), the model was contributed to the Transformers ecosystem on January 8, 2024. The SigLIP family includes base, large, and shape-optimized 400M variants. This specific checkpoint is optimized for zero-shot inference and can be used directly for tasks such as classifying images against arbitrary text labels or retrieving relevant images from natural language queries.
best for
- ·Zero-shot classification of product images without labeled training data
- ·Image-text retrieval for content-based search or recommendation
- ·Multi-label classification where multiple concepts may be present in an image
FAQ
It excels at zero-shot image classification and image-text retrieval, where the model can compare arbitrary text labels or captions with images without requiring fine-tuning.
SigLIP replaces the softmax-based contrastive loss of CLIP with a pairwise sigmoid loss. This removes the need for global pairwise similarity normalization and allows better performance with smaller batch sizes and easier scaling to very large batches.
Images must be resized to 384x384 pixels, normalized with mean 0.5 and std 0.5 per channel. Text must be tokenized and padded to 64 tokens. The model processes image-text pairs and returns logits and sigmoid probabilities.
Use the OpenAI-compatible endpoint with your gigarouter API key. Send a request with the image URL or base64-encoded image and the candidate text labels. The API returns probability scores for each label.
The model card does not specify a license. The original paper is open access, but users should verify terms on the official Hugging Face repository before commercial use.
We're benchmarking and onboarding SigLIP Large 384 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.