ViT-SO400M-14 SigLIP

timm/ViT-SO400M-14-SigLIP

published Oct 2023 · updated Oct 2023

ViT-SO400M-14 SigLIP is a zero-shot-image model that uses a sigmoid loss for language-image pre-training, enabling contrastive image-text classification.

status

coming soon

API providers

downloads / mo

116.5K

license

apache-2.0

specs

Task	Zero-Shot Image Classification
Architecture	ViT-SO400M-14 (Vision Transformer with patch size 14)
Parameters	400M
License	Unknown (not specified in card)

about this model

ViT-SO400M-14-SigLIP is a zero-shot image classification model that uses a sigmoid loss function for language-image pre-training, enabling it to classify images into arbitrary categories without task-specific fine-tuning.

Model Architecture and Training

This model implements the SigLIP approach, which employs a pairwise sigmoid loss instead of standard contrastive softmax normalization. This design operates solely on image-text pairs without requiring global pairwise similarity normalization, allowing effective training at both small and large batch sizes. The model was trained on the WebLI dataset and converted from original JAX checkpoints in Big Vision to PyTorch weights usable in both OpenCLIP and timm frameworks.

Key Strengths

Efficient training: Combined with Locked-image Tuning, achieves 84.5% ImageNet zero-shot accuracy in two days using only four TPUv4 chips.
Flexible batch scaling: The sigmoid loss disentangles batch size from loss computation, enabling effective training with batch sizes up to one million, though benefits diminish beyond 32k.
Improved small-batch performance: Performs better than softmax-based approaches at smaller batch sizes.

Benchmark Results

The model achieves 84.5% zero-shot accuracy on ImageNet, demonstrating strong performance for open-vocabulary classification tasks without task-specific training data.

Usage Through gigarouter API

As a hosted API, gigarouter provides access to ViT-SO400M-14-SigLIP for zero-shot classification. The model accepts images and arbitrary text labels, returning probability scores for each label. No local installation or model loading is required.

Citation

Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343.

best for

·Zero-shot classification of images using custom text labels
·Extracting image embeddings for downstream tasks

FAQ

What is the main advantage of SigLIP over standard contrastive models?

SigLIP uses a pairwise sigmoid loss that does not require global normalization, allowing flexible batch sizes and better performance at smaller batches.

What dataset was this model trained on?

It was trained on WebLI.

How can I use this model for zero-shot classification?

You can use it via the gigarouter OpenAI-compatible endpoint with an API key, or locally with OpenCLIP or timm as shown in the model card.

What input size does the model expect?

The model expects 224x224 pixel images.

Does the model support both image and text encoding?

Yes, it supports both image and text encoding for contrastive tasks, though the timm version provides only image embeddings.

not yet live

We're benchmarking and onboarding ViT-SO400M-14 SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336