VLM2Vec Full

TIGER-Lab/VLM2Vec-Full

published Oct 2024 · updated Apr 2025

VLM2Vec Full is a text-generation model that converts vision-language models into universal multimodal embedding models for tasks such as classification, retrieval, and visual question answering.

status

coming soon

API providers

downloads / mo

529.9K

license

apache-2.0

specs

Task	text-generation
Architecture	Phi-3.5-V based (Phi-3.5 vision-instruct)
Parameters	4.15B
License	Apache 2.0

about this model

VLM2Vec-Full is a multimodal embedding model that converts any combination of images and text into a fixed-dimensional vector based on task instructions, built by fine-tuning the Phi-3.5-Vision-Instruct vision-language model (4.15B parameters, BF16 precision) on the MMEB benchmark. Unlike earlier models such as CLIP or BLIP that encode text or images independently without task instructions, VLM2Vec processes arbitrary image-text inputs to produce embeddings guided by a user-provided task instruction, enabling a wide range of downstream tasks including classification, visual question answering, multimodal retrieval, and visual grounding.

Key Strengths

Trained with contrastive learning on MMEB-train (20 datasets) and evaluated on MMEB-eval (16 datasets covering both in-distribution and out-of-distribution tasks).
Achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in the MMEB benchmark, as reported in the VLM2Vec paper (arXiv:2410.05160).
Supports four meta-tasks: classification, visual question answering, multimodal retrieval, and visual grounding, across 36 total datasets.
Released under Apache 2.0 license.

Performance Overview

The model outperforms prior multimodal embedding baselines by a substantial margin across all evaluated tasks. Detailed results on the 36 MMEB evaluation datasets are available in the VLM2Vec GitHub repository and the associated paper.

Architecture

VLM2Vec-Full is a full-parameter fine-tuned variant of Phi-3.5-Vision-Instruct, using a contrastive training framework with in-batch negatives. It employs a batch size of 2048 during training and produces normalized embeddings via last-token pooling.

best for

·Multimodal retrieval (image-to-text and text-to-image)
·Visual question answering
·Image classification with task instructions

FAQ

What is VLM2Vec Full best for?

It is best for generating fixed-dimensional embeddings from any combination of images and text, enabling multimodal retrieval, classification, and visual question answering.

How does VLM2Vec Full compare to CLIP or BLIP?

Unlike CLIP or BLIP, VLM2Vec Full processes task instructions and can handle arbitrary image-text combinations, achieving 10-20% absolute improvement on the MMEB benchmark.

What is the input format for the API?

Inputs are text prompts with optional image tokens (e.g., <|image_1|>) and corresponding images. The model outputs a normalized embedding vector.

How can I call VLM2Vec Full via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key. Send a request with the model name and input data as described in the documentation.

What is the license of VLM2Vec Full?

It is released under the Apache 2.0 license.

not yet live

We're benchmarking and onboarding VLM2Vec Full as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related text generation models

tiny-Qwen2ForCausalLM-2.5

9.2M dl/mo

deepseek-v4-gguf

6.4M dl/mo

Qwen3.6-35B-A3B-NVFP4

6.2M dl/mo

gemma-3-270m

5.1M dl/mo