ViT B 16 SigLIP

timm/ViT-B-16-SigLIP

published Oct 2023 · updated Oct 2023

ViT B 16 SigLIP is a zero-shot image model that classifies images by comparing their embeddings to text prompts using a sigmoid loss function.

status

coming soon

API providers

downloads / mo

101.4K

license

apache-2.0

specs

Task	Zero-Shot Image Classification (Contrastive Image-Text)
Architecture	ViT-B/16 (Vision Transformer, patch size 16)
Dataset	WebLI
License	Apache 2.0

about this model

ViT-B-16-SigLIP is a zero-shot image classification model that uses a sigmoid loss for language-image pre-training (SigLIP), enabling contrastive matching between images and arbitrary text labels without fine-tuning.

Model details

Trained on the WebLI dataset and converted to PyTorch from the original JAX checkpoints, the model is designed for zero-shot classification and image-text retrieval. Its sigmoid loss function operates on individual image-text pairs, eliminating the need for global pairwise normalization found in standard contrastive losses. This allows flexible batch scaling — performance is strong at batch sizes as small as 32k, while the loss still benefits from larger batches up to one million.

Key strengths and results

Achieves 84.5% ImageNet zero-shot accuracy when combined with Locked-image Tuning, trained on only four TPUv4 chips in two days (as reported in the SigLIP paper, ICCV 2023 Oral).
Licensed under Apache 2.0, suitable for commercial and research use.
Widely adopted: over 101,000 downloads on Hugging Face as of early 2025.

As a hosted API on gigarouter, the model is available for zero-shot classification tasks — simply provide an image and a list of candidate labels to obtain sigmoid-normalized probabilities.

best for

·Zero-shot classification of images with custom text labels
·Image-text similarity scoring for retrieval
·Building multimodal search applications

FAQ

What is the input size for this model?

The model expects images resized to 224x224 pixels.

What license is the model released under?

It is released under the Apache 2.0 license.

How does SigLIP differ from CLIP?

SigLIP uses a sigmoid loss instead of softmax contrastive loss, enabling training without global pairwise normalization and allowing larger batch sizes.

How can I use this model via the gigarouter API?

You can call the OpenAI-compatible endpoint with your API key, providing an image and text prompts for zero-shot classification.

What dataset was the model trained on?

The model was trained on the WebLI dataset (Web Language-Image).

not yet live

We're benchmarking and onboarding ViT B 16 SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336