Llama Nemotron Rerank 1B

nvidia/llama-nemotron-rerank-1b-v2

published Oct 2025 · updated May 2026

Llama Nemotron Rerank 1B is a rerank model that provides logit scores for relevance between a query and documents, optimized for multilingual and cross-lingual retrieval with support for long documents up to 8192 tokens.

est. price

~$0.008

/ 1k docs · estimated, set at launch

API providers

downloads / mo

231K

license

other

specs

Task	Reranking
Architecture	Transformer cross-encoder fine-tuned from Llama 3.2-1B
Parameters	1B
Max Sequence Length	8192 tokens
License	NVIDIA Open Model License & Llama 3.2 Community License

about this model

Llama Nemotron Reranking 1B (v2) is a multilingual cross‑encoder reranking model that produces relevance logit scores for query‑document pairs, supporting sequences up to 8192 tokens.

Architecture and Training

Fine‑tuned from meta-llama/Llama-3.2-1B, the model uses contrastive learning with bi‑directional attention, mean pooling over the decoder’s last hidden state, and a binary classification head. It was trained on 800k samples from public QA datasets that carry commercial‑use licenses (excluding MS MARCO due to licensing restrictions).

Evaluation Results

When paired with the Llama Nemotron embedding model (1B), the reranker delivers high accuracy on BEIR+TechQA benchmarks while supporting 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. The model is 3.5× smaller than the nv‑rerankqa‑mistral‑4b‑v3, offering a compact alternative for production retrieval pipelines.

Integration and Use

As a component in a retrieval‑augmented generation (RAG) system, this reranker typically follows an embedding‑based or lexical retriever. It applies cross‑attention between the query and each candidate document to produce scores, which can be converted to probabilities via a sigmoid function. The model is commercially ready and is part of the NeMo Retriever NIM microservice collection.

best for

·Multilingual document retrieval reranking in RAG pipelines
·Enterprise search (IT, HR help assistants)
·Research and development research assistants

FAQ

What is the input format for this model?

Input is a list of text pairs (query and document) formatted as "question: [query] \n \n passage: [document]".

What does the model output?

It outputs raw logit scores (floats) representing relevance; can be converted to probabilities with sigmoid.

How does this model compare to larger rerankers?

It is 3.5x smaller than the nv-rerankqa-mistral-4b-v3 model, offering faster inference while maintaining high accuracy.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key and pass the model ID "nvidia/llama-nemotron-rerank-1b-v2".

not yet live

We're benchmarking and onboarding Llama Nemotron Rerank 1B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related reranker models

compare all →

ms-marco-MiniLM-L6-v2

81.5M dl/mo · live

ms-marco-MiniLM-L4-v2

4.8M dl/mo

gte-reranker-modernbert-base

2.7M dl/mo

ms-marco-MiniLM-L12-v2

2.3M dl/mo

jina-reranker-v2-base-multilingual

1.8M dl/mo · live

Qwen3-Reranker-4B

1.8M dl/mo