BLIP Image Captioning Large

Salesforce/blip-image-captioning-large

published Dec 2022 · updated Feb 2025

BLIP Image Captioning Large is an image-to-text model that generates descriptive captions for images using a Vision-Language Pre-training framework.

est. price

~$0.094

/ 1k images · estimated, set at launch

API providers

downloads / mo

752.9K

license

bsd-3-clause

specs

Task	Image-to-text (image captioning)
Architecture	Vision Transformer Large (ViT-L) backbone with multimodal encoder-decoder
Dataset	COCO (pretrained on 129M image-text pairs)
License	BSD-3-Clause

about this model

Salesforce/blip-image-captioning-large is an image-to-text model that generates descriptive captions for images, built on the BLIP (Bootstrapping Language-Image Pre-training) framework with a Vision Transformer large (ViT-L) backbone. It can produce both unconditional captions and conditional captions guided by a text prompt.

Model Description

BLIP unifies vision-language understanding and generation in a single framework, bootstrapping noisy web data by generating synthetic captions and filtering them. This approach achieves state-of-the-art results across multiple tasks without requiring task-specific architectures.

Key Strengths

Unified architecture – transfers to both understanding tasks (e.g., retrieval) and generation tasks (e.g., captioning, VQA).
Bootstrapping method – a captioner creates synthetic captions and a filter removes noise, effectively leveraging large-scale web data.
Strong generalization – demonstrates zero-shot performance on video-language tasks.

Performance Benchmarks

On the COCO dataset, BLIP with ViT-L achieves the following improvements over prior state-of-the-art models:

Image-text retrieval: +2.7% in average recall@1
Image captioning: +2.8% in CIDEr score
Visual Question Answering: +1.6% in VQA score

These results are reported in the BLIP paper (arXiv:2201.12086). The model is pre-trained on 129 million images and fine-tuned for captioning on COCO.

Usage via Gigarouter

Gigarouter hosts this model as a managed, OpenAI-compatible API. Users send image data and receive generated captions without managing infrastructure. The model supports both conditional and unconditional captioning through a single API endpoint.

best for

·Automatic alt-text generation for images in web content
·Describing scenes for accessibility (screen readers)
·Captioning product images for e-commerce

FAQ

What tasks can BLIP Image Captioning Large perform?

It performs image captioning (both conditional and unconditional), image-text retrieval, and visual question answering when finetuned, but this model is specialized for captioning.

How does this model compare to the smaller BLIP base version?

BLIP Large uses a ViT-L backbone with more parameters, resulting in higher accuracy on captioning benchmarks like CIDEr (+2.8% over prior work). It is slower but more accurate.

What input format does the model expect?

The model accepts an image (processed to RGB tensor) and optionally a text prompt for conditional captioning. The output is a natural language caption string.

How can I use this model via the gigarouter API?

Call the gigarouter OpenAI-compatible endpoint with an API key, send the image as a base64-encoded string or URL, and receive the generated caption in the response.

Is this model free to use?

Yes, it is released under the BSD-3-Clause license, allowing commercial use with attribution. However, the GitHub repo is deprecated; the recommended library is LAVIS.

not yet live

We're benchmarking and onboarding BLIP Image Captioning Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

trocr-small-handwritten

448.6K dl/mo

PP-LCNet_x1_0_doc_ori

445.3K dl/mo