models / embeddings · coming soon

Jina CLIP V1

jinaai/jina-clip-v1

published May 2024 · updated Apr 2026

Jina CLIP V1 is a multimodal (text-image) embedding model that excels in both text-to-text and text-to-image retrieval.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

61K

license

apache-2.0

specs

Task	Multimodal Embedding (Text-Image)
Architecture	CLIP-based dual encoder
License	Apache-2.0

about this model

jinaai/jina-clip-v1 is an English multimodal text-image embedding model that achieves state-of-the-art performance on both text-image retrieval and text-text retrieval tasks within a single model, bridging the gap between dedicated text embedding models and CLIP-style cross-modal models.

Dual-Modal Capabilities

Traditional text embedding models excel at text-to-text retrieval but cannot handle cross-modal queries. Standard CLIP models effectively align image and text embeddings but are not optimized for text-to-text retrieval. jina-clip-v1 matches the text retrieval efficiency of jina-embeddings-v2-base-en while setting new benchmarks for cross-modal retrieval, enabling seamless text-to-text and text-to-image search in one model. This makes it suitable for multimodal retrieval-augmented generation (MuRAG) applications.

Benchmark Performance

Text-Image Retrieval (Flickr30k and MSCOCO datasets):

Dataset	Task	jina-clip	ViT-B-16	ViT-B-32
Flickr30k	Image Retr. R@1	0.6748	0.6216	0.597
Flickr30k	Image Retr. R@5	0.8902	0.8572	0.8398
Flickr30k	Text Retr. R@1	0.811	0.822	0.781
Flickr30k	Text Retr. R@5	0.965	0.966	0.938
MSCOCO	Image Retr. R@1	0.4111	0.3309	0.342
MSCOCO	Image Retr. R@5	0.6644	0.5842	0.6001
MSCOCO	Text Retr. R@1	0.5544	0.5242	0.5234
MSCOCO	Text Retr. R@5	0.7904	0.767	0.7634

Text-Text Retrieval on STS and BEIR benchmarks, jina-clip shows comparable or superior performance to jina-embeddings-v2-base-en, achieving a Spearman correlation of 0.8493 on STSBenchmark and NDCG@10 of 0.7161 on TRECCOVID.

The model is built with PyTorch, supports ONNX and Safetensors, and is released under the Apache-2.0 license. Its paper was presented at the MFM-EAI workshop at ICML 2024. For further details, see the full paper.

best for

·Multimodal retrieval-augmented generation (MuRAG)
·Text-to-image search
·Cross-modal similarity search

FAQ

What is the model best for?

It is best for both text-to-text and text-to-image retrieval, enabling single-model multimodal search.

How does it compare to jina-embeddings-v2?

It matches the text retrieval efficiency of jina-embeddings-v2 while adding cross-modal capabilities.

What is the license?

Apache-2.0, as shown on the Hugging Face model page.

How do I call it via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key to send text or image inputs for embedding.

What input formats does it support?

Accepts text strings, image URLs, PIL images, local filenames, and data URIs for images.

not yet live

We're benchmarking and onboarding Jina CLIP V1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5