GME Qwen2-VL 2B

Alibaba-NLP/gme-Qwen2-VL-2B-Instruct

published Dec 2024 · updated Jun 2025

GME Qwen2-VL 2B is a multimodal embedding model that converts text, images, or image-text pairs into unified vector representations for retrieval.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

9.1K

license

apache-2.0

specs

Task	Multimodal Embedding
Architecture	Qwen2-VL-based MLLM
Parameters	2.21B
Output Dimension	1536
Max Sequence Length	32768
License	Custom (see model card)

about this model

GME-Qwen2-VL-2B-Instruct is a multimodal embedding model that produces unified vector representations from text, image, or combined image-text inputs, enabling any-to-any retrieval across modalities. Developed by Tongyi Lab at Alibaba Group, it is based on the Qwen2-VL multimodal large language model and supports dynamic image resolution with a maximum sequence length of 32,768 tokens and an embedding dimension of 1,536.

Key Capabilities

The model processes three input types—text, image, and image-text pairs—and outputs a single embedding that can be used for text retrieval, image-to-text search, image-to-image search, and fused-modal search. This makes it suitable for tasks such as multimodal retrieval-augmented generation (RAG) on documents, visual document retrieval, and cross-modal search.

Benchmark Performance

On the Universal Multimodal Retrieval Benchmark (UMRB), the 2.2B-parameter model achieves an average score of 64.45, outperforming larger models such as DSE (4.2B, 50.04) and E5-V (8.4B, 42.52). It scores 87.84 on text-to-visual-document retrieval and 76.47 on text-to-image-text retrieval. On the MTEB English leaderboard, it scores 65.27, and on MTEB Chinese, 66.92. The paper describing the model has been accepted to CVPR 2025.

Strengths

Unified representation: A single model handles text-only, image-only, and fused inputs without separate encoders.
Strong visual document understanding: Excels at retrieving document screenshots and academic papers, critical for multimodal RAG.
Dynamic resolution: Adapts to varying image sizes without fixed resizing.

Limitations

The model accepts only single-image inputs and is trained on English data only; multilingual multimodal performance is not guaranteed.

best for

·Multimodal retrieval-augmented generation (RAG) for documents with text and images
·Cross-modal search (text-to-image, image-to-text, image-to-image)
·Understanding and retrieving information from document screenshots

FAQ

What input modalities does the model support?

It supports text, image, and image-text pairs, producing a single unified embedding for any combination.

What is the output embedding size?

The output dimension is 1536.

What is the maximum input length?

The model supports a maximum sequence length of 32768 tokens, with visual tokens limited to 1024 per image.

How can I call this model via API?

Use the gigarouter OpenAI-compatible endpoint with your API key; pass text, image URLs, or both as input.

What are the license terms?

Redistribution requires prominent display of "Built with GME" and prefixing any derivative AI model names with "GME".

not yet live

We're benchmarking and onboarding GME Qwen2-VL 2B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5