SigLIP ViT SO400M 384

timm/ViT-SO400M-14-SigLIP-384

published Oct 2023 · updated Oct 2023

SigLIP ViT SO400M 384 is a zero-shot image classification model that uses a sigmoid loss for contrastive image-text pre-training on the WebLI dataset.

status

coming soon

API providers

downloads / mo

242.3K

license

apache-2.0

specs

Task	Zero-Shot Image Classification
Architecture	Vision Transformer (ViT-SO400M-14) with patch size 14, image size 384
Parameters	400M
License	Apache 2.0

about this model

ViT-SO400M-14-SigLIP-384 is a contrastive vision-language model for zero-shot image classification that uses a sigmoid loss function for language-image pre-training, enabling it to classify images into arbitrary text-defined categories without task-specific fine-tuning. The model employs a Shape-Optimized 400 million parameter Vision Transformer (ViT-SO400M) with a patch size of 14 and 384x384 input resolution. It was trained on the WebLI dataset using the SigLIP approach, which replaces the standard softmax normalization in contrastive learning with a pairwise sigmoid loss. This design eliminates the need for global pairwise similarity normalization, allowing effective training at both small and large batch sizes. Key strengths include efficient training with modest compute: the SigLiT variant (SigLIP combined with Locked-image Tuning) achieves 84.5% zero-shot accuracy on ImageNet using only four TPUv4 chips trained over two days. The sigmoid loss also enables scaling experiments up to batch sizes of one million, though benefits diminish beyond 32k. The model is available under Apache 2.0 license and was presented as an Oral at ICCV 2023. The model supports both image-text contrastive tasks (via OpenCLIP) and standalone image embedding extraction (via timm). It is hosted on gigarouter as a managed API, providing OpenAI-compatible endpoints for zero-shot image classification without requiring local infrastructure.

best for

·Classifying images with arbitrary text labels without fine-tuning
·Building visual search and retrieval systems using text queries
·Extracting high-quality image embeddings for downstream tasks

FAQ

What is the architecture of this model?

It is a Vision Transformer (ViT) with shape-optimized 400M parameters (SO400M), patch size 14, and input size 384.

What loss function does SigLIP use?

SigLIP uses a pairwise sigmoid loss for contrastive image-text pre-training, unlike the standard softmax normalization.

What dataset was it trained on?

It was trained on the WebLI dataset.

How can I use this model via the gigarouter API?

Call the OpenAI-compatible endpoint with your API key, supplying an image and text labels for zero-shot classification.

What input format does the model expect?

Images should be resized to 384x384 pixels; text tokens are required for zero-shot classification.

not yet live

We're benchmarking and onboarding SigLIP ViT SO400M 384 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336