NV-Embed v2
nvidia/NV-Embed-v2
published Aug 2024 · updated Jul 2025
NV-Embed v2 is a generalist embedding model that ranks No. 1 on the MTEB benchmark with a score of 72.31 across 56 text embedding tasks.
specs
| Task | Text Embedding |
| Architecture | Decoder-only LLM (Mistral-7B-v0.1) with Latent-Attention pooling |
| Parameters | 7B |
| License | CC-BY-NC-4.0 (non-commercial) |
about this model
NV-Embed-v2 is an embedding model that produces dense vector representations of text, optimized for retrieval, semantic search, and other natural language tasks. It is built on a Mistral-7B-v0.1 decoder-only LLM and uses a latent attention pooling layer to generate pooled embeddings, which consistently outperforms mean pooling or last-token approaches. The model removes the causal attention mask during contrastive training to enhance representation learning, and applies a two-stage instruction-tuning method: first, contrastive training on retrieval datasets with in-batch negatives and curated hard negatives; second, blending non-retrieval tasks into instruction tuning to improve both retrieval and non-retrieval accuracy.
Benchmark Performance
As of August 30, 2024, NV-Embed-v2 holds the No. 1 position on the Massive Text Embedding Benchmark (MTEB) leaderboard with an overall score of 72.31 across 56 tasks. It also ranks No. 1 in the retrieval sub-category with a score of 62.65 across 15 tasks, making it a strong foundation for retrieval-augmented generation (RAG) pipelines. Additionally, it achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers out-of-domain information retrieval topics beyond MTEB.
Technical Highlights
- Pooling: Latent-Attention (embedding dimension: 4096).
- Training: Hard-negative mining methods that use positive relevance scores to remove false negatives.
- Acceptance: The underlying research paper (NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models) was accepted at ICLR 2025 as a Spotlight paper.
- Compression: The paper provides analysis of model compression techniques (pruning, knowledge distillation, quantization) for generalist embedding models.
For detailed technical descriptions, refer to the paper: NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.
best for
- ·Dense retrieval for RAG systems
- ·Semantic textual similarity and clustering
- ·Long-document embedding and retrieval
FAQ
The embedding dimension is 4096.
The model supports a maximum sequence length of 32,768 tokens.
No, the model is licensed under CC-BY-NC-4.0 and cannot be used for commercial purposes. For commercial use, NVIDIA recommends NeMo Retriever NIMs.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending the input text and the appropriate instruction prefix for the task.
It uses a latent-attention pooling layer instead of mean or last-token pooling, removes the causal attention mask during contrastive training, and employs a two-stage instruction-tuning method with novel hard-negative mining.
We're benchmarking and onboarding NV-Embed v2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.