SigLIP Base Patch16 384

google/siglip-base-patch16-384

published Jan 2024 · updated Sep 2024

SigLIP Base Patch16 384 is a zero-shot-image model that uses a sigmoid loss function for language-image pre-training, enabling tasks like zero-shot image classification and image-text retrieval.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

20.6K

license

apache-2.0

specs

Task	Zero-shot image classification, image-text retrieval
Architecture	CLIP-based multimodal model with Vision Transformer (ViT) and text encoder
Parameters	0.2B
License	Not specified in model card

about this model

google/siglip-base-patch16-384 is a zero-shot image classification and image-text retrieval model that uses a sigmoid loss for language-image pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates on individual image-text pairs and does not require a global view of pairwise similarities for normalization. This design allows the model to scale to larger batch sizes while maintaining strong performance even at smaller batch sizes.

Model Architecture and Training

The model is a multimodal vision-language model based on the CLIP architecture with a ViT-B/16 image encoder and a text encoder. It has 0.2 billion parameters and was pre-trained on the English image-text pairs of the WebLI dataset at a resolution of 384x384 pixels. Images are rescaled and normalized across RGB channels (mean 0.5, std 0.5); text is tokenized and padded to 64 tokens. Training used 16 TPU-v4 chips over three days.

Benchmark Performance

The paper introducing SigLIP reports that a related variant with a locked image tower (SigLiT) achieves 84.5% ImageNet zero-shot accuracy using only 4 TPUv4 chips in two days. The following comparison, taken from the paper, illustrates SigLIP’s zero-shot accuracy relative to CLIP across multiple datasets:

Zero-shot accuracy comparison of SigLIP and CLIP on multiple benchmarks

best for

·Zero-shot image classification without task-specific training
·Image-text retrieval and similarity scoring

FAQ

What is the main advantage of SigLIP over CLIP?

SigLIP uses a sigmoid loss that operates on individual image-text pairs, removing the need for global pairwise similarity normalization and enabling better performance at smaller batch sizes.

What input format does the model expect?

Images are resized to 384x384 and normalized with mean 0.5 and std 0.5. Text is tokenized and padded to 64 tokens.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and candidate text labels for zero-shot classification.

What is the parameter count of this model?

It has 0.2 billion parameters.

What hardware was the model trained on?

The base SigLIP model was trained on 16 TPU-v4 chips for three days.

not yet live

We're benchmarking and onboarding SigLIP Base Patch16 384 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336