Llama Nemotron Rerank VL 1B V2
nvidia/llama-nemotron-rerank-vl-1b-v2
published Dec 2025 · updated May 2026
Llama Nemotron Rerank VL 1B V2 is a multimodal reranking model that scores relevance between a text query and document pages (as images, text, or both) for vision RAG pipelines.
specs
| Task | Reranking / Multimodal Retrieval |
| Architecture | Cross-encoder with Eagle VLM (SigLIP 2 400M vision encoder + Llama 3.2 1B language model) |
| Parameters | ~1.7B |
| License | NVIDIA Open Model License |
about this model
NVIDIA Llama-Nemotron-Rerank-VL-1B-V2 is a multimodal cross-encoder reranking model that assigns a relevance logit score to a document page (image, text, or both) for a given text query, designed to improve the accuracy of vision RAG pipelines by reordering top candidates retrieved by dense embedding models.
Architecture and Capabilities
The model combines a SigLIP 2 400M vision encoder with a Llama 3.2 1B language model (Eagle 2 architecture, ~1.7B parameters). It processes documents as images of pages, slides, tables, charts, or infographics, and supports text-only and image+text inputs. A mean pooling aggregation and binary classification head are fine-tuned with cross-entropy loss to maximize the likelihood of relevant documents. The architecture incorporates dynamic tiling and mixture of vision encoders (Eagle 2) to improve high-resolution and complex visual content understanding.
Key Strengths
- Deep cross-attention between query and document tokens for more accurate relevance scoring than embedding-based similarity.
- Multimodal input flexibility: accepts images, text, or combined image-text documents.
- Optimized for deployment as a reranker in a retrieval pipeline, complementing multimodal embedding models.
Evaluation
The model was evaluated on the ViDoRe V1, V2, and V3 multimodal retrieval benchmarks (Vidore Leaderboard) and on two internally curated visual retrieval datasets. Outputs are raw logits; users may apply sigmoid to obtain probabilities.
Licensing
Use is governed by the NVIDIA Open Model License and the Llama 3.2 Community License; built with Llama.
best for
- ·Multimodal document reranking in RAG pipelines
- ·Reordering candidate documents from embedding-based retrieval
- ·Question-answering over visual documents (slides, charts, tables)
FAQ
It supports image, text, or image+text combined as document input, with the query always in text form.
It processes visual content such as screenshots of slides, tables, charts, and infographics, enabling multimodal relevance scoring.
The model outputs a list of raw logits (floats). Users can apply a Sigmoid activation to convert them into probabilities.
Use the gigarouter OpenAI-compatible endpoint with an API key. Refer to the gigarouter documentation for endpoint details and example requests.
The model is governed by the NVIDIA Open Model License Agreement. Post-processing scripts are licensed under Apache 2.0.
We're benchmarking and onboarding Llama Nemotron Rerank VL 1B V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.