Llama Nemotron Embed 1B V2
nvidia/llama-nemotron-embed-1b-v2
published Oct 2025 · updated May 2026
Llama Nemotron Embed 1B V2 is a multilingual embedding model optimized for question-answering retrieval with support for long documents up to 8192 tokens and dynamic embedding sizes.
specs
| Task | Embedding |
| Architecture | Fine-tuned Llama 3.2 1B Transformer |
| Parameters | 1B |
| Context Length | 8192 tokens |
| License | NVIDIA Open Model License |
about this model
llama-nemotron-embed-1b-v2 is an embedding model optimized for multilingual and cross-lingual question-answering retrieval, supporting long documents up to 8192 tokens and dynamic embedding sizes (Matryoshka embeddings). It is a fine-tuned version of Llama 3.2 1B, using a transformer encoder architecture with a bi-encoder setup for contrastive learning.
Key Capabilities
- Multilingual support across 26 languages, including English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish.
- Context length of up to 8192 tokens, enabling retrieval over longer passages without chunking.
- Dynamic embedding size configurable to 384, 512, 768, 1024, or 2048, reducing storage footprint by up to 35× compared to fixed-size embeddings.
- Trained on a blend of public QA datasets with commercial-friendly licenses, using 12M samples for semi-supervised pre-training and 1M samples for fine-tuning.
Evaluation Results
The model was evaluated offline on A100 GPUs using PyTorch checkpoints. The table below reports Recall@5 on four QA benchmarks (NQ, HotpotQA, FiQA, TechQA).
| Open & Commercial Retrieval Models | Average Recall@5 (NQ, HotpotQA, FiQA, TechQA) |
|---|---|
| llama-nemotron-embed-1b-v2 (embedding dim 2048) | 68.60% |
| llama-nemotron-embed-1b-v2 (embedding dim 384) | 64.48% |
| llama-3.2-nv-embedqa-1b-v1 (embedding dim 2048) | 68.97% |
| nv-embedqa-mistral-7b-v2 | 72.97% |
| nv-embedqa-mistral-7B-v1 | 64.93% |
| nv-embedqa-e5-v5 | 62.07% |
| nv-embedqa-e5-v4 | 57.65% |
| e5-large-unsupervised | 48.03% |
| BM25 | 44.67% |
The model is part of the NVIDIA NeMo Retriever collection, designed for production-ready information retrieval pipelines with enterprise support. It is available as a hosted, OpenAI-compatible API through gigarouter, eliminating the need for local installation or hardware management.
best for
- ·Multilingual question-answering over large text corpora
- ·Retrieval-Augmented Generation (RAG) pipelines
- ·Cross-lingual document retrieval
FAQ
It is designed for multilingual and cross-lingual text retrieval, particularly question-answering over large document collections.
Use the OpenAI-compatible endpoint with your API key, sending text input as a list of strings.
Input is a list of strings; output is a list of float arrays representing embedding vectors of configurable dimension (384, 512, 768, 1024, or 2048).
It was evaluated on 26 languages including English, Arabic, Chinese, French, German, Japanese, Spanish, and many others.
It is governed by the NVIDIA Open Model License Agreement, with additional terms from the Llama 3.2 Community License.
We're benchmarking and onboarding Llama Nemotron Embed 1B V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.