Question 1

What is SigLIP Large 384 best used for?

Accepted Answer

It excels at zero-shot image classification and image-text retrieval, where the model can compare arbitrary text labels or captions with images without requiring fine-tuning.

Question 2

How does SigLIP differ from standard CLIP?

Accepted Answer

SigLIP replaces the softmax-based contrastive loss of CLIP with a pairwise sigmoid loss. This removes the need for global pairwise similarity normalization and allows better performance with smaller batch sizes and easier scaling to very large batches.

Question 3

What input formats are required?

Accepted Answer

Images must be resized to 384x384 pixels, normalized with mean 0.5 and std 0.5 per channel. Text must be tokenized and padded to 64 tokens. The model processes image-text pairs and returns logits and sigmoid probabilities.

Question 4

How can I call SigLIP Large 384 via the gigarouter API?

Accepted Answer

Use the OpenAI-compatible endpoint with your gigarouter API key. Send a request with the image URL or base64-encoded image and the candidate text labels. The API returns probability scores for each label.

Question 5

What are the license terms for this model?

Accepted Answer

The model card does not specify a license. The original paper is open access, but users should verify terms on the official Hugging Face repository before commercial use.

Task	Zero-shot image classification, image-text retrieval
Architecture	CLIP-based multimodal model with Vision Transformer (ViT) and text encoder
Training Data	WebLI dataset (English image-text pairs)
Image Resolution	384x384 pixels (resized and normalized)
Training Compute	16 TPU-v4 chips for 3 days
Loss Function	Sigmoid loss (pairwise, no global normalization)

SigLIP Large 384

specs

about this model

Key strengths

Evaluation against CLIP

Model background

best for

FAQ

related zero-shot image models