skip to content
gigarouter gigarouter
models / embeddings · coming soon

Jina CLIP V1

jinaai/jina-clip-v1

published May 2024 · updated Apr 2026

Jina CLIP V1 is a multimodal (text-image) embedding model that excels in both text-to-text and text-to-image retrieval.

est. price
~$0.008
/ 1M tokens · estimated, set at launch
API providers
0
downloads / mo
61K
license
apache-2.0

specs

TaskMultimodal Embedding (Text-Image)
ArchitectureCLIP-based dual encoder
LicenseApache-2.0

about this model

Jina AI embedding set logo

jinaai/jina-clip-v1 is an English multimodal text-image embedding model that achieves state-of-the-art performance on both text-image retrieval and text-text retrieval tasks within a single model, bridging the gap between dedicated text embedding models and CLIP-style cross-modal models.

Dual-Modal Capabilities

Traditional text embedding models excel at text-to-text retrieval but cannot handle cross-modal queries. Standard CLIP models effectively align image and text embeddings but are not optimized for text-to-text retrieval. jina-clip-v1 matches the text retrieval efficiency of jina-embeddings-v2-base-en while setting new benchmarks for cross-modal retrieval, enabling seamless text-to-text and text-to-image search in one model. This makes it suitable for multimodal retrieval-augmented generation (MuRAG) applications.

Benchmark Performance

Text-Image Retrieval (Flickr30k and MSCOCO datasets):

DatasetTaskjina-clipViT-B-16ViT-B-32
Flickr30kImage Retr. R@10.67480.62160.597
Flickr30kImage Retr. R@50.89020.85720.8398
Flickr30kText Retr. R@10.8110.8220.781
Flickr30kText Retr. R@50.9650.9660.938
MSCOCOImage Retr. R@10.41110.33090.342
MSCOCOImage Retr. R@50.66440.58420.6001
MSCOCOText Retr. R@10.55440.52420.5234
MSCOCOText Retr. R@50.79040.7670.7634

Text-Text Retrieval on STS and BEIR benchmarks, jina-clip shows comparable or superior performance to jina-embeddings-v2-base-en, achieving a Spearman correlation of 0.8493 on STSBenchmark and NDCG@10 of 0.7161 on TRECCOVID.

The model is built with PyTorch, supports ONNX and Safetensors, and is released under the Apache-2.0 license. Its paper was presented at the MFM-EAI workshop at ICML 2024. For further details, see the full paper.

best for

FAQ

What is the model best for?

It is best for both text-to-text and text-to-image retrieval, enabling single-model multimodal search.

How does it compare to jina-embeddings-v2?

It matches the text retrieval efficiency of jina-embeddings-v2 while adding cross-modal capabilities.

What is the license?

Apache-2.0, as shown on the Hugging Face model page.

How do I call it via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key to send text or image inputs for embedding.

What input formats does it support?

Accepts text strings, image URLs, PIL images, local filenames, and data URIs for images.

not yet live

We're benchmarking and onboarding Jina CLIP V1 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →