E5 Omni 7B
Haon-Chen/e5-omni-7B
published Jan 2026 · updated Apr 2026
E5 Omni 7B is a visual-document-retrieval model that produces unified embeddings for text, images, audio, and video, enabling cross-modal retrieval.
specs
| Task | Visual Document Retrieval |
| Architecture | Qwen2.5-Omni-7B |
| Parameters | 7B |
about this model
e5-omni-7B is an omni-modal embedding model built on Qwen2.5-Omni-7B that produces a unified embedding space for text, images, audio, and video, enabling cross-modal retrieval with a single model.
Key Strengths
Unlike models that rely on implicit alignment from vision-language backbones, e5-omni applies three explicit alignment techniques:
- modality-aware temperature calibration to align similarity scales across modalities
- a controllable negative curriculum with debiasing that focuses on confusing negatives while reducing false negative impact
- batch whitening with covariance regularization to match cross-modal geometry in the shared embedding space
These components address common issues in omni-modal embeddings: inconsistent score sharpness, imbalanced in-batch negative hardness, and mismatched first- and second-order statistics across modalities.
Benchmark Performance
The model achieves strong results on the MMEB-V2 and AudioCaps benchmarks.
Full experimental results and comparisons to bi-modal and omni-modal baselines are documented in the associated paper (arXiv:2601.03666). The explicit alignment recipe also transfers to other VLM backbones.
best for
- ·Image document retrieval (charts, PDFs, screenshots)
- ·Video retrieval (tutorials, clips)
- ·Audio retrieval (music, speech)
- ·Multilingual text retrieval
FAQ
It supports text, image, audio, and video.
Use the gigarouter OpenAI-compatible endpoint with an API key; refer to gigarouter documentation for details.
The model card does not specify a license.
We're benchmarking and onboarding E5 Omni 7B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.