Fashion CLIP

patrickjohncyh/fashion-clip

published Feb 2023 · updated Sep 2024

Fashion CLIP is a zero-shot image model that produces general product representations for fashion concepts by fine-tuning a CLIP ViT-B/32 model on a large fashion dataset.

est. price

~$0.047

/ 1k images · estimated, set at launch

API providers

downloads / mo

2.9M

license

mit

specs

Task	Zero-Shot Image Classification
Architecture	ViT-B/32
Parameters	151M
License	MIT

about this model

FashionCLIP is a zero-shot image classification model that adapts the CLIP architecture (ViT-B/32) for the fashion domain through fine-tuning on a proprietary dataset of 800K fashion products from Farfetch. The model is hosted by Gigarouter as an OpenAI-compatible API, enabling developers to classify and retrieve fashion images using natural language queries without task-specific training.

Key Strengths

Specialized for fashion: trained on product images (centered, white background) and concatenated highlight and short description text.
Improved zero-shot performance: Fine-tuned from the laion/CLIP-ViT-B-32-laion2B-s34B-b79K checkpoint, which itself benefits from 5× more pre-training data than the original OpenAI CLIP.
Published results in Scientific Reports (Nature, 2022), demonstrating domain-specific fine-tuning yields transferable product representations across multiple fashion benchmarks.

Benchmark Results (Weighted Macro F1 Score)

Model	FMNIST	KAGL	DEEP
OpenAI CLIP	0.66	0.63	0.45
FashionCLIP (1.0)	0.74	0.67	0.48
Laion CLIP	0.78	0.71	0.58
FashionCLIP 2.0	0.83	0.73	0.62

FashionCLIP 2.0 achieves the highest scores across all three benchmarks, with the largest gain on the diversified e-commerce dataset DEEP (+0.17 over FashionCLIP 1.0).

Model Details

Architecture: ViT-B/32 image encoder + masked self-attention text encoder (151M total parameters).
Training: Contrastive loss on (image, text) pairs from Farfetch; updated March 2023 using the laion CLIP checkpoint.
License: MIT.
Formats: ONNX, Safetensors, PyTorch.

Limitations

The model inherits biases from both CLIP and the Farfetch dataset. It performs best with longer, descriptive queries and standard product images (centered, white background, no humans). It may associate clothing attributes with gender due to training data phrasing.

For further reading: Nature paper

best for

·Fashion product categorization (e.g., classify clothing type)
·Fashion image search and retrieval by natural language description

FAQ

What is Fashion CLIP best used for?

Fashion CLIP is best for zero-shot fashion product categorization and image retrieval using natural language queries.

What architecture does Fashion CLIP use?

It uses a ViT-B/32 vision encoder and a masked self-attention text encoder, fine-tuned from the LAION CLIP checkpoint.

How many parameters does Fashion CLIP have?

Fashion CLIP has approximately 151 million parameters.

What license is Fashion CLIP released under?

It is released under the MIT license.

How do I call Fashion CLIP via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key; the model accepts images and text prompts for zero-shot classification or similarity scoring.

not yet live

We're benchmarking and onboarding Fashion CLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336

3.4M dl/mo

PickScore_v1

3.2M dl/mo

siglip-so400m-patch14-384

1.8M dl/mo