skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

SigLIP Large 384

google/siglip-large-patch16-384

published Jan 2024 · updated Sep 2024

SigLIP Large 384 is a zero-shot-image model that uses a sigmoid loss for language-image pre-training, enabling image classification and retrieval without task-specific fine-tuning.

est. price
~$0.235
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
39.4K
license
apache-2.0

specs

TaskZero-shot image classification, image-text retrieval
ArchitectureCLIP-based multimodal model with Vision Transformer (ViT) and text encoder
Training DataWebLI dataset (English image-text pairs)
Image Resolution384x384 pixels (resized and normalized)
Training Compute16 TPU-v4 chips for 3 days
Loss FunctionSigmoid loss (pairwise, no global normalization)

about this model

google/siglip-large-patch16-384 is a zero-shot image classification and image-text retrieval model that extends the CLIP architecture with a pairwise sigmoid loss, eliminating the need for global pairwise similarity normalization. This design enables effective training at both small and large batch sizes, scaling up to one million examples per batch while maintaining performance with a practical batch size of 32k.

Key strengths

  • Improved loss function: The sigmoid loss operates independently on each image-text pair, allowing larger batch scaling and better downstream accuracy compared to softmax-based contrastive learning.
  • High resolution input: Pre-trained on 384×384 images from the WebLI dataset, balancing detail and computational efficiency.
  • Proven benchmark results: The SigLiT variant (SigLIP + Locked-image Tuning) achieves 84.5% ImageNet zero-shot accuracy using only four TPUv4 chips in two days. The paper was presented as an Oral at ICCV 2023.
  • Efficient training: This large-sized model was trained on 16 TPU-v4 chips over three days.

Evaluation against CLIP

The following comparison from the original paper illustrates SigLIP’s performance across multiple zero-shot benchmarks relative to CLIP:

Zero-shot accuracy comparison between SigLIP and CLIP on several image classification datasets

Model background

Introduced by Zhai et al. in Sigmoid Loss for Language Image Pre-Training (2023), the model was contributed to the Transformers ecosystem on January 8, 2024. The SigLIP family includes base, large, and shape-optimized 400M variants. This specific checkpoint is optimized for zero-shot inference and can be used directly for tasks such as classifying images against arbitrary text labels or retrieving relevant images from natural language queries.

best for

FAQ

What is SigLIP Large 384 best used for?

It excels at zero-shot image classification and image-text retrieval, where the model can compare arbitrary text labels or captions with images without requiring fine-tuning.

How does SigLIP differ from standard CLIP?

SigLIP replaces the softmax-based contrastive loss of CLIP with a pairwise sigmoid loss. This removes the need for global pairwise similarity normalization and allows better performance with smaller batch sizes and easier scaling to very large batches.

What input formats are required?

Images must be resized to 384x384 pixels, normalized with mean 0.5 and std 0.5 per channel. Text must be tokenized and padded to 64 tokens. The model processes image-text pairs and returns logits and sigmoid probabilities.

How can I call SigLIP Large 384 via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key. Send a request with the image URL or base64-encoded image and the candidate text labels. The API returns probability scores for each label.

What are the license terms for this model?

The model card does not specify a license. The original paper is open access, but users should verify terms on the official Hugging Face repository before commercial use.

not yet live

We're benchmarking and onboarding SigLIP Large 384 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →