skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

Fashion SigLIP

Marqo/marqo-fashionSigLIP

published Aug 2024 · updated Feb 2026

Fashion SigLIP is a zero-shot image-text embedding model that retrieves fashion products from images and text descriptions, fine-tuned for fashion search.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
642.8K
license
apache-2.0

specs

TaskZero-shot image-text retrieval / multimodal embedding
ArchitectureViT-B-16-SigLIP fine-tuned with Generalized Contrastive Learning (GCL)
Parameters150M
LicenseApache 2.0

about this model

Marqo/marqo-fashionSigLIP is a multimodal embedding model for zero-shot image and text retrieval, fine-tuned from ViT-B-16-SigLIP (webli) using Generalized Contrastive Learning (GCL). It is a 150M parameter model released under the Apache 2.0 license.

The model was trained on over 1 million fashion products with rich metadata, using a 7-component loss function that optimizes for long descriptions, product titles, colors, materials, categories, details, and keywords. This enables highly relevant search results on fashion products without requiring task-specific fine-tuning.

Benchmark Performance

Evaluated across 7 public multimodal fashion datasets (Atlas, DeepFashion In-shop, DeepFashion Multimodal, Fashion200k, KAGL, Polyvore, and iMaterialist), Marqo-FashionSigLIP achieves the following averaged results:

TaskMetricMarqo-FashionSigLIPFashionCLIP2.0Improvement
Text-To-ImageAvgRecall0.2310.163+42%
Text-To-ImageRecall@10.1210.077+57%
Text-To-ImageMRR0.2390.165+45%
Category-To-ProductP@10.7580.681+11%
Category-To-ProductMRR0.8120.741+10%
Sub-Category-To-ProductP@10.7670.676+13%
Sub-Category-To-ProductMRR0.8110.733+11%

Inference is approximately 10% faster than FashionCLIP2.0 and OpenFashionCLIP for combined text and image processing. A successor model, Marqo-FashionSigLIP-2, is available with a further 78% improvement in MRR and recall.

best for

FAQ

What is Fashion SigLIP best for?

It is best for zero-shot fashion product retrieval, enabling text-to-image, image-to-image, and category-based search with high accuracy.

How does Fashion SigLIP compare to FashionCLIP?

It provides up to 57% improvement in MRR and recall over FashionCLIP, with faster inference and better performance on six fashion benchmarks.

What license is the model released under?

Fashion SigLIP is released under the Apache 2.0 license, allowing free use, modification, and distribution.

What are the input and output formats for this model?

Input: image (PIL or tensor) and/or text strings. Output: normalized image and text embeddings (vectors). Processed via Hugging Face AutoModel or OpenCLIP.

How can I call Fashion SigLIP via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending text and image inputs in the request to get embeddings or similarity scores.

not yet live

We're benchmarking and onboarding Fashion SigLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →