SigLIP Base Multilingual
google/siglip-base-patch16-256-multilingual
published Jan 2024 · updated Sep 2024
SigLIP Base Multilingual is a zero-shot image classification and image-text retrieval model that uses a sigmoid loss for language-image pre-training.
specs
| Task | Zero-shot image classification and image-text retrieval |
| Architecture | Vision Transformer (ViT) base, patch size 16, resolution 256x256 |
| Training Data | WebLI dataset (no language filter) |
| Compute | 16 TPU-v4 chips for 3 days |
about this model
google/siglip-base-patch16-256-multilingual is a zero-shot image classification and image-text retrieval model that uses a pairwise sigmoid loss function for language-image pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of pairwise similarities for normalization, enabling scaling to larger batch sizes while maintaining strong performance at smaller batch sizes.
Pre-trained on the WebLI dataset at 256x256 resolution, the model processes images resized and normalized across RGB channels (mean 0.5, std 0.5) and tokenizes text to 64 tokens. It was trained on 16 TPU-v4 chips over three days.
Key strengths
- Uses a sigmoid loss function that decouples batch size from loss computation, allowing effective training at batch sizes from small to 32k (optimal) and up to one million.
- Multilingual support via training on WebLI without language filtering.
- Presented as an Oral paper at ICCV'23.
Benchmark results
A SigLiT variant (SigLIP with Locked-image Tuning) achieves 84.5% ImageNet zero-shot accuracy when trained on only 4 TPUv4 chips for two days. The paper includes comparisons of SigLIP against CLIP across multiple benchmarks, as shown below.
This model is hosted on gigarouter as a managed, OpenAI-compatible API. No local installation or model loading is required; users send inference requests directly to the API endpoint.
best for
- ·Zero-shot classification of images into custom categories without fine-tuning
- ·Multilingual image-text retrieval for search and recommendation systems
- ·Building image similarity or captioning pipelines with flexible text prompts
FAQ
SigLIP uses a pairwise sigmoid loss instead of softmax normalization, allowing training with larger batch sizes and better performance at smaller batches.
Images resized to 256x256, normalized with mean=0.5, std=0.5; text tokenized and padded to 64 tokens.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image URL or base64 and a list of candidate text labels.
Yes, it was pre-trained on the WebLI dataset without language filter and supports multiple languages.
According to the paper, a batch size of 32k is sufficient; larger batches yield diminishing returns.
We're benchmarking and onboarding SigLIP Base Multilingual as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.