Fashion SigLIP
Marqo/marqo-fashionSigLIP
published Aug 2024 · updated Feb 2026
Fashion SigLIP is a zero-shot image-text embedding model that retrieves fashion products from images and text descriptions, fine-tuned for fashion search.
specs
| Task | Zero-shot image-text retrieval / multimodal embedding |
| Architecture | ViT-B-16-SigLIP fine-tuned with Generalized Contrastive Learning (GCL) |
| Parameters | 150M |
| License | Apache 2.0 |
about this model
Marqo/marqo-fashionSigLIP is a multimodal embedding model for zero-shot image and text retrieval, fine-tuned from ViT-B-16-SigLIP (webli) using Generalized Contrastive Learning (GCL). It is a 150M parameter model released under the Apache 2.0 license.
The model was trained on over 1 million fashion products with rich metadata, using a 7-component loss function that optimizes for long descriptions, product titles, colors, materials, categories, details, and keywords. This enables highly relevant search results on fashion products without requiring task-specific fine-tuning.
Benchmark Performance
Evaluated across 7 public multimodal fashion datasets (Atlas, DeepFashion In-shop, DeepFashion Multimodal, Fashion200k, KAGL, Polyvore, and iMaterialist), Marqo-FashionSigLIP achieves the following averaged results:
| Task | Metric | Marqo-FashionSigLIP | FashionCLIP2.0 | Improvement |
|---|---|---|---|---|
| Text-To-Image | AvgRecall | 0.231 | 0.163 | +42% |
| Text-To-Image | Recall@1 | 0.121 | 0.077 | +57% |
| Text-To-Image | MRR | 0.239 | 0.165 | +45% |
| Category-To-Product | P@1 | 0.758 | 0.681 | +11% |
| Category-To-Product | MRR | 0.812 | 0.741 | +10% |
| Sub-Category-To-Product | P@1 | 0.767 | 0.676 | +13% |
| Sub-Category-To-Product | MRR | 0.811 | 0.733 | +11% |
Inference is approximately 10% faster than FashionCLIP2.0 and OpenFashionCLIP for combined text and image processing. A successor model, Marqo-FashionSigLIP-2, is available with a further 78% improvement in MRR and recall.
best for
- ·Fashion product search and recommendation in e-commerce
- ·Visual similarity search for apparel and accessories
- ·Text-to-image and category-to-product matching for retail catalogs
- ·Multi-modal retrieval combining images, colors, materials, and keywords
FAQ
It is best for zero-shot fashion product retrieval, enabling text-to-image, image-to-image, and category-based search with high accuracy.
It provides up to 57% improvement in MRR and recall over FashionCLIP, with faster inference and better performance on six fashion benchmarks.
Fashion SigLIP is released under the Apache 2.0 license, allowing free use, modification, and distribution.
Input: image (PIL or tensor) and/or text strings. Output: normalized image and text embeddings (vectors). Processed via Hugging Face AutoModel or OpenCLIP.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending text and image inputs in the request to get embeddings or similarity scores.
We're benchmarking and onboarding Fashion SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.