skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

ViT-SO400M-14 SigLIP

timm/ViT-SO400M-14-SigLIP

published Oct 2023 · updated Oct 2023

ViT-SO400M-14 SigLIP is a zero-shot-image model that uses a sigmoid loss for language-image pre-training, enabling contrastive image-text classification.

status
coming soon
API providers
0
downloads / mo
116.5K
license
apache-2.0

specs

TaskZero-Shot Image Classification
ArchitectureViT-SO400M-14 (Vision Transformer with patch size 14)
Parameters400M
LicenseUnknown (not specified in card)

about this model

ViT-SO400M-14-SigLIP is a zero-shot image classification model that uses a sigmoid loss function for language-image pre-training, enabling it to classify images into arbitrary categories without task-specific fine-tuning.

Model Architecture and Training

This model implements the SigLIP approach, which employs a pairwise sigmoid loss instead of standard contrastive softmax normalization. This design operates solely on image-text pairs without requiring global pairwise similarity normalization, allowing effective training at both small and large batch sizes. The model was trained on the WebLI dataset and converted from original JAX checkpoints in Big Vision to PyTorch weights usable in both OpenCLIP and timm frameworks.

Key Strengths

  • Efficient training: Combined with Locked-image Tuning, achieves 84.5% ImageNet zero-shot accuracy in two days using only four TPUv4 chips.
  • Flexible batch scaling: The sigmoid loss disentangles batch size from loss computation, enabling effective training with batch sizes up to one million, though benefits diminish beyond 32k.
  • Improved small-batch performance: Performs better than softmax-based approaches at smaller batch sizes.

Benchmark Results

The model achieves 84.5% zero-shot accuracy on ImageNet, demonstrating strong performance for open-vocabulary classification tasks without task-specific training data.

Usage Through gigarouter API

As a hosted API, gigarouter provides access to ViT-SO400M-14-SigLIP for zero-shot classification. The model accepts images and arbitrary text labels, returning probability scores for each label. No local installation or model loading is required.

Citation

Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343.

best for

FAQ

What is the main advantage of SigLIP over standard contrastive models?

SigLIP uses a pairwise sigmoid loss that does not require global normalization, allowing flexible batch sizes and better performance at smaller batches.

What dataset was this model trained on?

It was trained on WebLI.

How can I use this model for zero-shot classification?

You can use it via the gigarouter OpenAI-compatible endpoint with an API key, or locally with OpenCLIP or timm as shown in the model card.

What input size does the model expect?

The model expects 224x224 pixel images.

Does the model support both image and text encoding?

Yes, it supports both image and text encoding for contrastive tasks, though the timm version provides only image embeddings.

not yet live

We're benchmarking and onboarding ViT-SO400M-14 SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →