skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP Base Patch16 384

google/siglip-base-patch16-384

published Jan 2024 · updated Sep 2024

SigLIP Base Patch16 384 is a zero-shot-image model that uses a sigmoid loss function for language-image pre-training, enabling tasks like zero-shot image classification and image-text retrieval.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
20.6K
license
apache-2.0

specs

TaskZero-shot image classification, image-text retrieval
ArchitectureCLIP-based multimodal model with Vision Transformer (ViT) and text encoder
Parameters0.2B
LicenseNot specified in model card

about this model

google/siglip-base-patch16-384 is a zero-shot image classification and image-text retrieval model that uses a sigmoid loss for language-image pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates on individual image-text pairs and does not require a global view of pairwise similarities for normalization. This design allows the model to scale to larger batch sizes while maintaining strong performance even at smaller batch sizes.

Model Architecture and Training

The model is a multimodal vision-language model based on the CLIP architecture with a ViT-B/16 image encoder and a text encoder. It has 0.2 billion parameters and was pre-trained on the English image-text pairs of the WebLI dataset at a resolution of 384x384 pixels. Images are rescaled and normalized across RGB channels (mean 0.5, std 0.5); text is tokenized and padded to 64 tokens. Training used 16 TPU-v4 chips over three days.

Benchmark Performance

The paper introducing SigLIP reports that a related variant with a locked image tower (SigLiT) achieves 84.5% ImageNet zero-shot accuracy using only 4 TPUv4 chips in two days. The following comparison, taken from the paper, illustrates SigLIP’s zero-shot accuracy relative to CLIP across multiple datasets:

Zero-shot accuracy comparison of SigLIP and CLIP on multiple benchmarks

best for

FAQ

What is the main advantage of SigLIP over CLIP?

SigLIP uses a sigmoid loss that operates on individual image-text pairs, removing the need for global pairwise similarity normalization and enabling better performance at smaller batch sizes.

What input format does the model expect?

Images are resized to 384x384 and normalized with mean 0.5 and std 0.5. Text is tokenized and padded to 64 tokens.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, passing an image and candidate text labels for zero-shot classification.

What is the parameter count of this model?

It has 0.2 billion parameters.

What hardware was the model trained on?

The base SigLIP model was trained on 16 TPU-v4 chips for three days.

not yet live

We're benchmarking and onboarding SigLIP Base Patch16 384 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →