Fashion CLIP
patrickjohncyh/fashion-clip
published Feb 2023 · updated Sep 2024
Fashion CLIP is a zero-shot image model that produces general product representations for fashion concepts by fine-tuning a CLIP ViT-B/32 model on a large fashion dataset.
specs
| Task | Zero-Shot Image Classification |
| Architecture | ViT-B/32 |
| Parameters | 151M |
| License | MIT |
about this model
FashionCLIP is a zero-shot image classification model that adapts the CLIP architecture (ViT-B/32) for the fashion domain through fine-tuning on a proprietary dataset of 800K fashion products from Farfetch. The model is hosted by Gigarouter as an OpenAI-compatible API, enabling developers to classify and retrieve fashion images using natural language queries without task-specific training.
Key Strengths
- Specialized for fashion: trained on product images (centered, white background) and concatenated highlight and short description text.
- Improved zero-shot performance: Fine-tuned from the laion/CLIP-ViT-B-32-laion2B-s34B-b79K checkpoint, which itself benefits from 5× more pre-training data than the original OpenAI CLIP.
- Published results in Scientific Reports (Nature, 2022), demonstrating domain-specific fine-tuning yields transferable product representations across multiple fashion benchmarks.
Benchmark Results (Weighted Macro F1 Score)
| Model | FMNIST | KAGL | DEEP |
|---|---|---|---|
| OpenAI CLIP | 0.66 | 0.63 | 0.45 |
| FashionCLIP (1.0) | 0.74 | 0.67 | 0.48 |
| Laion CLIP | 0.78 | 0.71 | 0.58 |
| FashionCLIP 2.0 | 0.83 | 0.73 | 0.62 |
FashionCLIP 2.0 achieves the highest scores across all three benchmarks, with the largest gain on the diversified e-commerce dataset DEEP (+0.17 over FashionCLIP 1.0).
Model Details
- Architecture: ViT-B/32 image encoder + masked self-attention text encoder (151M total parameters).
- Training: Contrastive loss on (image, text) pairs from Farfetch; updated March 2023 using the laion CLIP checkpoint.
- License: MIT.
- Formats: ONNX, Safetensors, PyTorch.
Limitations
The model inherits biases from both CLIP and the Farfetch dataset. It performs best with longer, descriptive queries and standard product images (centered, white background, no humans). It may associate clothing attributes with gender due to training data phrasing.
For further reading: Nature paper
best for
- ·Fashion product categorization (e.g., classify clothing type)
- ·Fashion image search and retrieval by natural language description
FAQ
Fashion CLIP is best for zero-shot fashion product categorization and image retrieval using natural language queries.
It uses a ViT-B/32 vision encoder and a masked self-attention text encoder, fine-tuned from the LAION CLIP checkpoint.
Fashion CLIP has approximately 151 million parameters.
It is released under the MIT license.
Use the gigarouter OpenAI-compatible endpoint with your API key; the model accepts images and text prompts for zero-shot classification or similarity scoring.
We're benchmarking and onboarding Fashion CLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.