BLIP-2 OPT 2.7B

Salesforce/blip2-opt-2.7b

published Feb 2023 · updated Feb 2025

BLIP-2 OPT 2.7B is a vision-language model that uses a frozen CLIP-like image encoder and a frozen OPT-2.7B language model bridged by a lightweight Querying Transformer for tasks like image captioning and visual question answering.

est. price

~$0.626

/ 1k images · estimated, set at launch

API providers

downloads / mo

669.8K

license

mit

specs

Task	Vision-Language Pretraining (image captioning, visual question answering)
Architecture	CLIP-like image encoder + Querying Transformer (Q-Former) + OPT-2.7B language model
Parameters	2.7 billion (OPT-2.7B) plus Q-Former
License	BSD-3-Clause

about this model

Salesforce/blip2-opt-2.7b is a vision-language model (VLM) that generates text conditioned on an image and optional textual input. It uses a frozen CLIP-like image encoder, a frozen OPT-2.7b language model, and a lightweight Querying Transformer (Q-Former) that learns to bridge the modality gap. The Q-Former is a BERT-like transformer that maps learnable query tokens into embeddings that the language model can attend to. The model is pre-trained in two stages: first for vision-language representation learning, then for vision-to-language generative learning, both keeping the image encoder and LLM frozen. BLIP-2 achieves strong zero-shot results on vision-language benchmarks. On VQAv2, it scores 65.0, outperforming Flamingo80B by 8.7 percentage points despite having 54× fewer trainable parameters. On NoCaps zero-shot captioning, it achieves a CIDEr score of 121.6, surpassing the previous best of 113.2. The model supports tasks including image captioning, visual question answering, and chat-like interactions by feeding the image and prior conversation as prompt. BLIP-2 architecture diagram showing frozen image encoder, Q-Former, and frozen LLM.

BLIP-2 architecture diagram showing frozen image encoder, Q-Former, and frozen LLM.

The model inherits limitations common to large language models, such as potential bias, hallucination, and sensitivity to training data quality. It has not been tested in real-world applications and is intended for research purposes. The pre-trained checkpoint is available through gigarouter’s hosted API, enabling access without local infrastructure.

best for

·Generating captions for images
·Answering natural language questions about images

FAQ

What is BLIP-2 OPT 2.7B best for?

Image captioning and visual question answering (VQA).

What input formats does the model accept?

An image (via URL or base64) and an optional text prompt.

How do I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending a request with image and optional prompt.

What is the approximate GPU memory requirement in float16?

About 7.21 GB total size.

Does the model support conversational chat?

Yes, it can be used for chat-like conversations by feeding the image and previous conversation as prompt.

not yet live

We're benchmarking and onboarding BLIP-2 OPT 2.7B as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related vision-language models

compare all →

Qwen2.5-VL-7B-Instruct

9.8M dl/mo

Qwen3.6-35B-A3B-FP8

6.2M dl/mo

Qwen2.5-VL-3B-Instruct

5.3M dl/mo

gemma-4-26B-A4B-it-AWQ-4bit