Jina CLIP V1
jinaai/jina-clip-v1
published May 2024 · updated Apr 2026
Jina CLIP V1 is a multimodal (text-image) embedding model that excels in both text-to-text and text-to-image retrieval.
specs
| Task | Multimodal Embedding (Text-Image) |
| Architecture | CLIP-based dual encoder |
| License | Apache-2.0 |
about this model
jinaai/jina-clip-v1 is an English multimodal text-image embedding model that achieves state-of-the-art performance on both text-image retrieval and text-text retrieval tasks within a single model, bridging the gap between dedicated text embedding models and CLIP-style cross-modal models.
Dual-Modal Capabilities
Traditional text embedding models excel at text-to-text retrieval but cannot handle cross-modal queries. Standard CLIP models effectively align image and text embeddings but are not optimized for text-to-text retrieval. jina-clip-v1 matches the text retrieval efficiency of jina-embeddings-v2-base-en while setting new benchmarks for cross-modal retrieval, enabling seamless text-to-text and text-to-image search in one model. This makes it suitable for multimodal retrieval-augmented generation (MuRAG) applications.
Benchmark Performance
Text-Image Retrieval (Flickr30k and MSCOCO datasets):
| Dataset | Task | jina-clip | ViT-B-16 | ViT-B-32 |
|---|---|---|---|---|
| Flickr30k | Image Retr. R@1 | 0.6748 | 0.6216 | 0.597 |
| Flickr30k | Image Retr. R@5 | 0.8902 | 0.8572 | 0.8398 |
| Flickr30k | Text Retr. R@1 | 0.811 | 0.822 | 0.781 |
| Flickr30k | Text Retr. R@5 | 0.965 | 0.966 | 0.938 |
| MSCOCO | Image Retr. R@1 | 0.4111 | 0.3309 | 0.342 |
| MSCOCO | Image Retr. R@5 | 0.6644 | 0.5842 | 0.6001 |
| MSCOCO | Text Retr. R@1 | 0.5544 | 0.5242 | 0.5234 |
| MSCOCO | Text Retr. R@5 | 0.7904 | 0.767 | 0.7634 |
Text-Text Retrieval on STS and BEIR benchmarks, jina-clip shows comparable or superior performance to jina-embeddings-v2-base-en, achieving a Spearman correlation of 0.8493 on STSBenchmark and NDCG@10 of 0.7161 on TRECCOVID.
The model is built with PyTorch, supports ONNX and Safetensors, and is released under the Apache-2.0 license. Its paper was presented at the MFM-EAI workshop at ICML 2024. For further details, see the full paper.
best for
- ·Multimodal retrieval-augmented generation (MuRAG)
- ·Text-to-image search
- ·Cross-modal similarity search
FAQ
It is best for both text-to-text and text-to-image retrieval, enabling single-model multimodal search.
It matches the text retrieval efficiency of jina-embeddings-v2 while adding cross-modal capabilities.
Apache-2.0, as shown on the Hugging Face model page.
Use the gigarouter OpenAI-compatible endpoint with an API key to send text or image inputs for embedding.
Accepts text strings, image URLs, PIL images, local filenames, and data URIs for images.
We're benchmarking and onboarding Jina CLIP V1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.