Llama Nemotron Rerank 1B
nvidia/llama-nemotron-rerank-1b-v2
published Oct 2025 · updated May 2026
Llama Nemotron Rerank 1B is a rerank model that provides logit scores for relevance between a query and documents, optimized for multilingual and cross-lingual retrieval with support for long documents up to 8192 tokens.
specs
| Task | Reranking |
| Architecture | Transformer cross-encoder fine-tuned from Llama 3.2-1B |
| Parameters | 1B |
| Max Sequence Length | 8192 tokens |
| License | NVIDIA Open Model License & Llama 3.2 Community License |
about this model
Llama Nemotron Reranking 1B (v2) is a multilingual cross‑encoder reranking model that produces relevance logit scores for query‑document pairs, supporting sequences up to 8192 tokens.
Architecture and Training
Fine‑tuned from meta-llama/Llama-3.2-1B, the model uses contrastive learning with bi‑directional attention, mean pooling over the decoder’s last hidden state, and a binary classification head. It was trained on 800k samples from public QA datasets that carry commercial‑use licenses (excluding MS MARCO due to licensing restrictions).
Evaluation Results
When paired with the Llama Nemotron embedding model (1B), the reranker delivers high accuracy on BEIR+TechQA benchmarks while supporting 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. The model is 3.5× smaller than the nv‑rerankqa‑mistral‑4b‑v3, offering a compact alternative for production retrieval pipelines.
Integration and Use
As a component in a retrieval‑augmented generation (RAG) system, this reranker typically follows an embedding‑based or lexical retriever. It applies cross‑attention between the query and each candidate document to produce scores, which can be converted to probabilities via a sigmoid function. The model is commercially ready and is part of the NeMo Retriever NIM microservice collection.
best for
- ·Multilingual document retrieval reranking in RAG pipelines
- ·Enterprise search (IT, HR help assistants)
- ·Research and development research assistants
FAQ
Input is a list of text pairs (query and document) formatted as "question: [query] \n \n passage: [document]".
It outputs raw logit scores (floats) representing relevance; can be converted to probabilities with sigmoid.
It is 3.5x smaller than the nv-rerankqa-mistral-4b-v3 model, offering faster inference while maintaining high accuracy.
Use the OpenAI-compatible endpoint with your gigarouter API key and pass the model ID "nvidia/llama-nemotron-rerank-1b-v2".
We're benchmarking and onboarding Llama Nemotron Rerank 1B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.