NV-Embed v2

nvidia/NV-Embed-v2

published Aug 2024 · updated Jul 2025

NV-Embed v2 is a generalist embedding model that ranks No. 1 on the MTEB benchmark with a score of 72.31 across 56 text embedding tasks.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

24.5K

license

cc-by-nc-4.0

specs

Task	Text Embedding
Architecture	Decoder-only LLM (Mistral-7B-v0.1) with Latent-Attention pooling
Parameters	7B
License	CC-BY-NC-4.0 (non-commercial)

about this model

NV-Embed-v2 is an embedding model that produces dense vector representations of text, optimized for retrieval, semantic search, and other natural language tasks. It is built on a Mistral-7B-v0.1 decoder-only LLM and uses a latent attention pooling layer to generate pooled embeddings, which consistently outperforms mean pooling or last-token approaches. The model removes the causal attention mask during contrastive training to enhance representation learning, and applies a two-stage instruction-tuning method: first, contrastive training on retrieval datasets with in-batch negatives and curated hard negatives; second, blending non-retrieval tasks into instruction tuning to improve both retrieval and non-retrieval accuracy.

Benchmark Performance

As of August 30, 2024, NV-Embed-v2 holds the No. 1 position on the Massive Text Embedding Benchmark (MTEB) leaderboard with an overall score of 72.31 across 56 tasks. It also ranks No. 1 in the retrieval sub-category with a score of 62.65 across 15 tasks, making it a strong foundation for retrieval-augmented generation (RAG) pipelines. Additionally, it achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers out-of-domain information retrieval topics beyond MTEB.

Technical Highlights

Pooling: Latent-Attention (embedding dimension: 4096).
Training: Hard-negative mining methods that use positive relevance scores to remove false negatives.
Acceptance: The underlying research paper (NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models) was accepted at ICLR 2025 as a Spotlight paper.
Compression: The paper provides analysis of model compression techniques (pruning, knowledge distillation, quantization) for generalist embedding models.

For detailed technical descriptions, refer to the paper: NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models.

best for

·Dense retrieval for RAG systems
·Semantic textual similarity and clustering
·Long-document embedding and retrieval

FAQ

What is the embedding dimension of NV-Embed v2?

The embedding dimension is 4096.

What is the maximum sequence length for input?

The model supports a maximum sequence length of 32,768 tokens.

Can I use NV-Embed v2 for commercial purposes?

No, the model is licensed under CC-BY-NC-4.0 and cannot be used for commercial purposes. For commercial use, NVIDIA recommends NeMo Retriever NIMs.

How do I call NV-Embed v2 via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the input text and the appropriate instruction prefix for the task.

What makes NV-Embed v2 different from other embedding models?

It uses a latent-attention pooling layer instead of mean or last-token pooling, removes the causal attention mask during contrastive training, and employs a two-stage instruction-tuning method with novel hard-negative mining.

not yet live

We're benchmarking and onboarding NV-Embed v2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5