skip to content
gigarouter gigarouter

E5-Omni 3B

Haon-Chen/e5-omni-3B

published Jan 2026 · updated Apr 2026

E5-Omni 3B is a visual-document-retrieval model that produces a single, unified embedding space for text, images, audio, and video, enabling accurate cross-modal retrieval.

status
coming soon
API providers
0
downloads / mo
76
license
mit

specs

TaskOmni-modal embedding and retrieval
ArchitectureBased on Qwen2.5-Omni-3B
Parameters3B
LicenseMIT

about this model

Haon-Chen/e5-omni-3B is an omni-modal embedding model that maps text, images, audio, and video into a unified embedding space, purpose-built for cross-modal retrieval tasks such as visual-document retrieval. It is built on Qwen2.5-Omni-3B and employs an explicit alignment recipe to overcome three common issues in omni-modal embeddings: modality-dependent similarity sharpness, in-batch negative hardness imbalance, and mismatched cross-modal statistics. The model combines modality-aware temperature calibration, a controllable negative curriculum with debiasing, and batch whitening with covariance regularization.

Benchmark Performance

On the MMEB-V2 and AudioCaps benchmarks, e5-omni-3B achieves consistent gains over strong bi-modal and omni-modal baselines.

Benchmark results on MMEB-V2 comparing e5-omni-3B against baselines Benchmark results on AudioCaps comparing e5-omni-3B against baselines

Key Capabilities

  • Cross-modal retrieval across text, images, audio, and video from a single model.
  • Produces normalized embeddings using the last hidden state of the final token for direct cosine similarity comparison.
  • Supports multilingual text retrieval (demonstrated with Chinese queries).
  • Document embedding can combine multiple modalities (e.g., text + video, image + audio) via a single dict input.

Additional Details

best for

FAQ

What modalities does E5-Omni 3B support?

It supports text, image, audio, and video inputs, producing a unified embedding space for all.

How does E5-Omni 3B compare to bi-modal models?

It explicitly aligns modalities using temperature calibration, negative curriculum, and batch whitening, achieving consistent gains on MMEB-V2 and AudioCaps benchmarks.

What is the license for E5-Omni 3B?

It is released under the MIT license.

How can I call E5-Omni 3B via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key to send queries and documents for embedding.

What is the input format for multimodal documents?

Pass a dict with keys like "text", "image", "audio", or "video" to encode combined modalities.

not yet live

We're benchmarking and onboarding E5-Omni 3B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related visual document retrieval models

compare all →