skip to content
gigarouter gigarouter
models / reranker · coming soon

Llama Nemotron Rerank VL 1B V2

nvidia/llama-nemotron-rerank-vl-1b-v2

published Dec 2025 · updated May 2026

Llama Nemotron Rerank VL 1B V2 is a multimodal reranking model that scores relevance between a text query and document pages (as images, text, or both) for vision RAG pipelines.

est. price
~$0.008
/ 1k docs · estimated, set at launch
API providers
0
downloads / mo
99.7K
license
other

specs

TaskReranking / Multimodal Retrieval
ArchitectureCross-encoder with Eagle VLM (SigLIP 2 400M vision encoder + Llama 3.2 1B language model)
Parameters~1.7B
LicenseNVIDIA Open Model License

about this model

NVIDIA Llama-Nemotron-Rerank-VL-1B-V2 is a multimodal cross-encoder reranking model that assigns a relevance logit score to a document page (image, text, or both) for a given text query, designed to improve the accuracy of vision RAG pipelines by reordering top candidates retrieved by dense embedding models.

Architecture and Capabilities

The model combines a SigLIP 2 400M vision encoder with a Llama 3.2 1B language model (Eagle 2 architecture, ~1.7B parameters). It processes documents as images of pages, slides, tables, charts, or infographics, and supports text-only and image+text inputs. A mean pooling aggregation and binary classification head are fine-tuned with cross-entropy loss to maximize the likelihood of relevant documents. The architecture incorporates dynamic tiling and mixture of vision encoders (Eagle 2) to improve high-resolution and complex visual content understanding.

Key Strengths

  • Deep cross-attention between query and document tokens for more accurate relevance scoring than embedding-based similarity.
  • Multimodal input flexibility: accepts images, text, or combined image-text documents.
  • Optimized for deployment as a reranker in a retrieval pipeline, complementing multimodal embedding models.

Evaluation

The model was evaluated on the ViDoRe V1, V2, and V3 multimodal retrieval benchmarks (Vidore Leaderboard) and on two internally curated visual retrieval datasets. Outputs are raw logits; users may apply sigmoid to obtain probabilities.

Licensing

Use is governed by the NVIDIA Open Model License and the Llama 3.2 Community License; built with Llama.

best for

FAQ

What input types does the model support?

It supports image, text, or image+text combined as document input, with the query always in text form.

How is this model different from text-only rerankers?

It processes visual content such as screenshots of slides, tables, charts, and infographics, enabling multimodal relevance scoring.

What is the output format?

The model outputs a list of raw logits (floats). Users can apply a Sigmoid activation to convert them into probabilities.

How can I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key. Refer to the gigarouter documentation for endpoint details and example requests.

What license applies to this model?

The model is governed by the NVIDIA Open Model License Agreement. Post-processing scripts are licensed under Apache 2.0.

not yet live

We're benchmarking and onboarding Llama Nemotron Rerank VL 1B V2 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related reranker models

compare all →