SigLIP Base Multilingual

google/siglip-base-patch16-256-multilingual

published Jan 2024 · updated Sep 2024

SigLIP Base Multilingual is a zero-shot image classification and image-text retrieval model that uses a sigmoid loss for language-image pre-training.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

22.6K

license

apache-2.0

specs

Task	Zero-shot image classification and image-text retrieval
Architecture	Vision Transformer (ViT) base, patch size 16, resolution 256x256
Training Data	WebLI dataset (no language filter)
Compute	16 TPU-v4 chips for 3 days

about this model

google/siglip-base-patch16-256-multilingual is a zero-shot image classification and image-text retrieval model that uses a pairwise sigmoid loss function for language-image pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of pairwise similarities for normalization, enabling scaling to larger batch sizes while maintaining strong performance at smaller batch sizes.

Pre-trained on the WebLI dataset at 256x256 resolution, the model processes images resized and normalized across RGB channels (mean 0.5, std 0.5) and tokenizes text to 64 tokens. It was trained on 16 TPU-v4 chips over three days.

Key strengths

Uses a sigmoid loss function that decouples batch size from loss computation, allowing effective training at batch sizes from small to 32k (optimal) and up to one million.
Multilingual support via training on WebLI without language filtering.
Presented as an Oral paper at ICCV'23.

Benchmark results

A SigLiT variant (SigLIP with Locked-image Tuning) achieves 84.5% ImageNet zero-shot accuracy when trained on only 4 TPUv4 chips for two days. The paper includes comparisons of SigLIP against CLIP across multiple benchmarks, as shown below.

Comparison chart of SigLIP versus CLIP evaluation results across multiple benchmarks

This model is hosted on gigarouter as a managed, OpenAI-compatible API. No local installation or model loading is required; users send inference requests directly to the API endpoint.

best for

·Zero-shot classification of images into custom categories without fine-tuning
·Multilingual image-text retrieval for search and recommendation systems
·Building image similarity or captioning pipelines with flexible text prompts

FAQ

What makes SigLIP different from CLIP?

SigLIP uses a pairwise sigmoid loss instead of softmax normalization, allowing training with larger batch sizes and better performance at smaller batches.

What input formats does the model expect?

Images resized to 256x256, normalized with mean=0.5, std=0.5; text tokenized and padded to 64 tokens.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending an image URL or base64 and a list of candidate text labels.

Is the model multilingual?

Yes, it was pre-trained on the WebLI dataset without language filter and supports multiple languages.

What is the optimal batch size for training SigLIP?

According to the paper, a batch size of 32k is sufficient; larger batches yield diminishing returns.

not yet live

We're benchmarking and onboarding SigLIP Base Multilingual as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336