SigLIP Base

google/siglip-base-patch16-512

published Jan 2024 · updated Sep 2024

SigLIP Base is a zero-shot-image model that uses a pairwise sigmoid loss for language-image pre-training, enabling tasks like zero-shot classification and image-text retrieval.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

8.5K

license

apache-2.0

specs

Task	Zero-shot image classification, image-text retrieval
Architecture	Multimodal model with sigmoid loss (variant of CLIP)
Pre-training Data	WebLI (English image-text pairs)
Input Resolution	512 x 512

about this model

google/siglip-base-patch16-512 is a zero-shot image classification model that uses a sigmoid loss function for language-image pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates on individual image-text pairs and does not require a global view of pairwise similarities for normalization. This design simultaneously enables scaling to larger batch sizes while improving performance at smaller batch sizes.

Overview

The model is based on the CLIP architecture but with a more effective loss function introduced by Zhai et al. (2023). It was pre-trained on the English image-text pairs of the WebLI dataset at a resolution of 512×512 pixels. Text inputs are tokenized and padded to 64 tokens; images are normalized across RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). Training used 16 TPU-v4 chips over three days.

Performance

The paper reports that a related Locked-image Tuning variant (SigLiT), trained with only four TPUv4 chips in two days, achieves 84.5% zero-shot top-1 accuracy on ImageNet. For the standard SigLIP pre-training procedure, a batch size of 32k is sufficient; benefits diminish quickly beyond that, even when scaling to one million.

An evaluation comparing SigLIP to CLIP is shown below (source: paper Sigmoid Loss for Language Image Pre-Training).

Graph comparing zero-shot accuracy of SigLIP and CLIP across multiple downstream tasks, showing SigLIP outperforming CLIP in most cases.

Architecture Details

The model uses a Vision Transformer (ViT) base-sized patch-16 architecture with 512px input resolution. It is designed for zero-shot image classification and image-text retrieval tasks without requiring task-specific fine-tuning.

best for

·Zero-shot classification of images into arbitrary text-defined categories
·Image-text retrieval: finding relevant images from text queries or vice versa
·Building custom image classifiers without fine-tuning

FAQ

What is the main advantage of SigLIP over standard CLIP?

SigLIP uses a pairwise sigmoid loss that does not require global pairwise normalization, allowing larger batch sizes and better performance at smaller batch sizes.

What input format does the model expect?

Images should be resized to 512x512 and normalized with mean 0.5 and std 0.5 per channel. Text is tokenized and padded to 64 tokens.

Can I use SigLIP Base for retrieval tasks?

Yes, it supports image-text retrieval by comparing image and text embeddings via the sigmoid logits.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key, providing an image and candidate labels or text pairs for zero-shot classification or retrieval.

What is the model's license?

The model card does not specify a license; please refer to the original repository (google-research/big_vision) for terms.

not yet live

We're benchmarking and onboarding SigLIP Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336