models / visual document retrieval · coming soon

E5-Omni 3B

Haon-Chen/e5-omni-3B

published Jan 2026 · updated Apr 2026

E5-Omni 3B is a visual-document-retrieval model that produces a single, unified embedding space for text, images, audio, and video, enabling accurate cross-modal retrieval.

status

coming soon

API providers

downloads / mo

license

mit

specs

Task	Omni-modal embedding and retrieval
Architecture	Based on Qwen2.5-Omni-3B
Parameters	3B
License	MIT

about this model

Haon-Chen/e5-omni-3B is an omni-modal embedding model that maps text, images, audio, and video into a unified embedding space, purpose-built for cross-modal retrieval tasks such as visual-document retrieval. It is built on Qwen2.5-Omni-3B and employs an explicit alignment recipe to overcome three common issues in omni-modal embeddings: modality-dependent similarity sharpness, in-batch negative hardness imbalance, and mismatched cross-modal statistics. The model combines modality-aware temperature calibration, a controllable negative curriculum with debiasing, and batch whitening with covariance regularization.

Benchmark Performance

On the MMEB-V2 and AudioCaps benchmarks, e5-omni-3B achieves consistent gains over strong bi-modal and omni-modal baselines.

Benchmark results on MMEB-V2 comparing e5-omni-3B against baselines

Benchmark results on AudioCaps comparing e5-omni-3B against baselines

Key Capabilities

Cross-modal retrieval across text, images, audio, and video from a single model.
Produces normalized embeddings using the last hidden state of the final token for direct cosine similarity comparison.
Supports multilingual text retrieval (demonstrated with Chinese queries).
Document embedding can combine multiple modalities (e.g., text + video, image + audio) via a single dict input.

Additional Details

Released under the MIT license.
Associated paper: e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings (arXiv, Jan 2026).
A 7B variant (e5-omni-7B) is also available.

best for

·Cross-modal document retrieval (e.g., searching images or videos with text queries)
·Multilingual text retrieval and embedding
·Multimodal document search combining text, image, audio, and video inputs

FAQ

What modalities does E5-Omni 3B support?

It supports text, image, audio, and video inputs, producing a unified embedding space for all.

How does E5-Omni 3B compare to bi-modal models?

It explicitly aligns modalities using temperature calibration, negative curriculum, and batch whitening, achieving consistent gains on MMEB-V2 and AudioCaps benchmarks.

What is the license for E5-Omni 3B?

It is released under the MIT license.

How can I call E5-Omni 3B via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key to send queries and documents for embedding.

What is the input format for multimodal documents?

Pass a dict with keys like "text", "image", "audio", or "video" to encode combined modalities.

not yet live

We're benchmarking and onboarding E5-Omni 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.