E5-V
royokong/e5-v
published Jul 2024 · updated Apr 2026
E5-V is a multimodal embedding model that adapts large multimodal language models to produce universal text and image embeddings.
specs
| Task | Multimodal Embeddings |
| Architecture | LLaVA-Next (Llama3-LLaVA-Next-8B) |
| Parameters | 8B |
| Embedding Dimension | 4096 |
about this model
E5-V is a multimodal embedding model that adapts a multimodal large language model (MLLM) to produce universal vector representations for both text and image inputs. Fine-tuned from lmms-lab/llama3-llava-next-8b, it outputs normalized 4096-dimensional embeddings suitable for cross-modal retrieval, similarity scoring, and clustering.
Key capabilities
E5-V bridges the modality gap through prompt-guided encoding. Text inputs are automatically wrapped with the instruction “Summary above sentence in one word:” and image inputs with “Summary above image in one word:”. The model uses the last hidden state’s final token as the representation, and embeddings are L2-normalized.
Single-modality training advantage
Unlike conventional approaches that require expensive image-text pair training, E5-V is trained exclusively on text pairs. This single-modality strategy reduces training costs by approximately 95% while often surpassing the performance of multimodal-trained models. The approach eliminates the need for costly multimodal data collection.
Benchmark results
E5-V achieves or surpasses state-of-the-art performance across four task types: text-image retrieval, composed image retrieval, image-image retrieval, and sentence embeddings — despite being trained on text only. Evaluation scripts cover COCO, Flickr30k, FashionIQ, CIRR, and STS benchmarks.
Sample inputs
The following images and text pairs demonstrate the model’s cross-modal similarity computation.
best for
- ·Text-image similarity search and retrieval
- ·Composed image retrieval with text queries
- ·Image-to-image retrieval
- ·Sentence embedding for clustering or classification
FAQ
E5-V accepts both text sentences and image URLs as input, producing a 4096-dimensional embedding for each.
The embedding is taken from the last hidden state of the final token (position -1).
E5-V is trained exclusively on text pairs (single modality training), reducing training costs by ~95% while achieving or surpassing state-of-the-art on four task types.
Use the standard OpenAI-compatible endpoint with your gigarouter API key, send text or image inputs in the request, and receive normalized embeddings.
We're benchmarking and onboarding E5-V as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.