VLM2Vec LoRA

TIGER-Lab/VLM2Vec-LoRA

published Oct 2024 · updated Jul 2025

VLM2Vec LoRA is a multimodal embedding model that converts a vision-language model (Phi-3.5-V) into a universal embedding model via contrastive training on the MMEB benchmark.

status

coming soon

API providers

downloads / mo

license

apache-2.0

specs

Task	Multimodal Embedding
Architecture	Phi-3.5-V with LoRA adapters
Pooling	Last token
Normalization	Yes
Training Data	MMEB-train (20 datasets)

about this model

VLM2Vec is a multimodal embedding model that converts a vision-language model (Phi-3.5-V) into a unified embedding model capable of processing any combination of images and text to generate a fixed-dimensional vector based on task instructions. It is designed for massive multimodal embedding tasks including classification, visual question answering, multimodal retrieval, and visual grounding.

The model is trained on the MMEB benchmark (20 training datasets) using contrastive learning with in-batch negatives and GradCache to increase effective batch size. It is evaluated on 16 MMEB evaluation datasets covering both in-distribution and out-of-distribution tasks. VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models across all MMEB evaluation datasets.

The model processes any combination of images and text to generate a fixed-dimensional vector based on task instructions, using an [EOS] token as the representation of multimodal inputs. Unlike CLIP or BLIP, which encode text or images independently, VLM2Vec can follow task instructions to produce task-specific embeddings.

Bar chart comparing VLM2Vec performance against baseline multimodal embedding models across MMEB evaluation datasets, showing VLM2Vec achieving higher average scores.

The VLM2Vec project has evolved through multiple versions. The V2.0 checkpoint (Apache-2.0 licensed) uses the Qwen2VL architecture. The MMEB benchmark has been extended to MMEB-V3, which covers 190 tasks including audio tasks (classification, cross-modal retrieval, temporal grounding), text retrieval (instruction-following, reasoning, long-context, multi-condition), and agent tasks (tool retrieval, GUI control, agent memory retrieval). MMEB-V3 also introduces OmniSET (Omni-modality Semantic Equivalence Tuples) for controlled analysis of modality effects and instruction-conditioned cross-modal retrieval behavior.

best for

·Multimodal retrieval: finding images that match a text description or vice versa
·Visual question answering via embedding similarity
·Image-text semantic similarity and clustering

FAQ

What is VLM2Vec LoRA?

It is a lightweight LoRA-tuned version of VLM2Vec based on Phi-3.5-V, trained on MMEB-train to produce universal multimodal embeddings.

What input modalities does it support?

It accepts any combination of text and images; the input is a task instruction followed by image tokens and text.

How do I call it via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; send text and image data as specified in the API documentation.

How does it compare to CLIP or BLIP?

VLM2Vec achieves 10-20% absolute improvement over existing multimodal embedding models on the MMEB benchmark across 36 datasets.

What tasks can it perform?

It covers four meta-tasks: classification, visual question answering, multimodal retrieval, and visual grounding, evaluated on 16 datasets.

not yet live

We're benchmarking and onboarding VLM2Vec LoRA as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

9.2M dl/mo

deepseek-v4-gguf

6.4M dl/mo

Qwen3.6-35B-A3B-NVFP4

6.2M dl/mo

gemma-3-270m

5.1M dl/mo