Fashion SigLIP

Marqo/marqo-fashionSigLIP

published Aug 2024 · updated Feb 2026

Fashion SigLIP is a zero-shot image-text embedding model that retrieves fashion products from images and text descriptions, fine-tuned for fashion search.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

642.8K

license

apache-2.0

specs

Task	Zero-shot image-text retrieval / multimodal embedding
Architecture	ViT-B-16-SigLIP fine-tuned with Generalized Contrastive Learning (GCL)
Parameters	150M
License	Apache 2.0

about this model

Marqo/marqo-fashionSigLIP is a multimodal embedding model for zero-shot image and text retrieval, fine-tuned from ViT-B-16-SigLIP (webli) using Generalized Contrastive Learning (GCL). It is a 150M parameter model released under the Apache 2.0 license.

The model was trained on over 1 million fashion products with rich metadata, using a 7-component loss function that optimizes for long descriptions, product titles, colors, materials, categories, details, and keywords. This enables highly relevant search results on fashion products without requiring task-specific fine-tuning.

Benchmark Performance

Evaluated across 7 public multimodal fashion datasets (Atlas, DeepFashion In-shop, DeepFashion Multimodal, Fashion200k, KAGL, Polyvore, and iMaterialist), Marqo-FashionSigLIP achieves the following averaged results:

Task	Metric	Marqo-FashionSigLIP	FashionCLIP2.0	Improvement
Text-To-Image	AvgRecall	0.231	0.163	+42%
Text-To-Image	Recall@1	0.121	0.077	+57%
Text-To-Image	MRR	0.239	0.165	+45%
Category-To-Product	P@1	0.758	0.681	+11%
Category-To-Product	MRR	0.812	0.741	+10%
Sub-Category-To-Product	P@1	0.767	0.676	+13%
Sub-Category-To-Product	MRR	0.811	0.733	+11%

Inference is approximately 10% faster than FashionCLIP2.0 and OpenFashionCLIP for combined text and image processing. A successor model, Marqo-FashionSigLIP-2, is available with a further 78% improvement in MRR and recall.

best for

·Fashion product search and recommendation in e-commerce
·Visual similarity search for apparel and accessories
·Text-to-image and category-to-product matching for retail catalogs
·Multi-modal retrieval combining images, colors, materials, and keywords

FAQ

What is Fashion SigLIP best for?

It is best for zero-shot fashion product retrieval, enabling text-to-image, image-to-image, and category-based search with high accuracy.

How does Fashion SigLIP compare to FashionCLIP?

It provides up to 57% improvement in MRR and recall over FashionCLIP, with faster inference and better performance on six fashion benchmarks.

What license is the model released under?

Fashion SigLIP is released under the Apache 2.0 license, allowing free use, modification, and distribution.

What are the input and output formats for this model?

Input: image (PIL or tensor) and/or text strings. Output: normalized image and text embeddings (vectors). Processed via Hugging Face AutoModel or OpenCLIP.

How can I call Fashion SigLIP via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending text and image inputs in the request to get embeddings or similarity scores.

not yet live

We're benchmarking and onboarding Fashion SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336