VLM2Vec Full
TIGER-Lab/VLM2Vec-Full
published Oct 2024 · updated Apr 2025
VLM2Vec Full is a text-generation model that converts vision-language models into universal multimodal embedding models for tasks such as classification, retrieval, and visual question answering.
specs
| Task | text-generation |
| Architecture | Phi-3.5-V based (Phi-3.5 vision-instruct) |
| Parameters | 4.15B |
| License | Apache 2.0 |
about this model
VLM2Vec-Full is a multimodal embedding model that converts any combination of images and text into a fixed-dimensional vector based on task instructions, built by fine-tuning the Phi-3.5-Vision-Instruct vision-language model (4.15B parameters, BF16 precision) on the MMEB benchmark. Unlike earlier models such as CLIP or BLIP that encode text or images independently without task instructions, VLM2Vec processes arbitrary image-text inputs to produce embeddings guided by a user-provided task instruction, enabling a wide range of downstream tasks including classification, visual question answering, multimodal retrieval, and visual grounding.
Key Strengths
- Trained with contrastive learning on MMEB-train (20 datasets) and evaluated on MMEB-eval (16 datasets covering both in-distribution and out-of-distribution tasks).
- Achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in the MMEB benchmark, as reported in the VLM2Vec paper (arXiv:2410.05160).
- Supports four meta-tasks: classification, visual question answering, multimodal retrieval, and visual grounding, across 36 total datasets.
- Released under Apache 2.0 license.
Performance Overview
The model outperforms prior multimodal embedding baselines by a substantial margin across all evaluated tasks. Detailed results on the 36 MMEB evaluation datasets are available in the VLM2Vec GitHub repository and the associated paper.
Architecture
VLM2Vec-Full is a full-parameter fine-tuned variant of Phi-3.5-Vision-Instruct, using a contrastive training framework with in-batch negatives. It employs a batch size of 2048 during training and produces normalized embeddings via last-token pooling.
best for
- ·Multimodal retrieval (image-to-text and text-to-image)
- ·Visual question answering
- ·Image classification with task instructions
FAQ
It is best for generating fixed-dimensional embeddings from any combination of images and text, enabling multimodal retrieval, classification, and visual question answering.
Unlike CLIP or BLIP, VLM2Vec Full processes task instructions and can handle arbitrary image-text combinations, achieving 10-20% absolute improvement on the MMEB benchmark.
Inputs are text prompts with optional image tokens (e.g., <|image_1|>) and corresponding images. The model outputs a normalized embedding vector.
Use the gigarouter OpenAI-compatible endpoint with your API key. Send a request with the model name and input data as described in the documentation.
It is released under the Apache 2.0 license.
We're benchmarking and onboarding VLM2Vec Full as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.