E5-V

royokong/e5-v

published Jul 2024 · updated Apr 2026

E5-V is a multimodal embedding model that adapts large multimodal language models to produce universal text and image embeddings.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

58.2K

specs

Task	Multimodal Embeddings
Architecture	LLaVA-Next (Llama3-LLaVA-Next-8B)
Parameters	8B
Embedding Dimension	4096

about this model

E5-V is a multimodal embedding model that adapts a multimodal large language model (MLLM) to produce universal vector representations for both text and image inputs. Fine-tuned from lmms-lab/llama3-llava-next-8b, it outputs normalized 4096-dimensional embeddings suitable for cross-modal retrieval, similarity scoring, and clustering.

Key capabilities

E5-V bridges the modality gap through prompt-guided encoding. Text inputs are automatically wrapped with the instruction “Summary above sentence in one word:” and image inputs with “Summary above image in one word:”. The model uses the last hidden state’s final token as the representation, and embeddings are L2-normalized.

Single-modality training advantage

Unlike conventional approaches that require expensive image-text pair training, E5-V is trained exclusively on text pairs. This single-modality strategy reduces training costs by approximately 95% while often surpassing the performance of multimodal-trained models. The approach eliminates the need for costly multimodal data collection.

Benchmark results

E5-V achieves or surpasses state-of-the-art performance across four task types: text-image retrieval, composed image retrieval, image-image retrieval, and sentence embeddings — despite being trained on text only. Evaluation scripts cover COCO, Flickr30k, FashionIQ, CIRR, and STS benchmarks.

Sample inputs

The following images and text pairs demonstrate the model’s cross-modal similarity computation.

best for

·Text-image similarity search and retrieval
·Composed image retrieval with text queries
·Image-to-image retrieval
·Sentence embedding for clustering or classification

FAQ

What input types does E5-V support?

E5-V accepts both text sentences and image URLs as input, producing a 4096-dimensional embedding for each.

How is the embedding extracted from the model?

The embedding is taken from the last hidden state of the final token (position -1).

What makes E5-V different from other multimodal embedding models?

E5-V is trained exclusively on text pairs (single modality training), reducing training costs by ~95% while achieving or surpassing state-of-the-art on four task types.

How can I call this model via the gigarouter API?

Use the standard OpenAI-compatible endpoint with your gigarouter API key, send text or image inputs in the request, and receive normalized embeddings.

not yet live

We're benchmarking and onboarding E5-V as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5