skip to content
gigarouter gigarouter
models / image-to-text · coming soon

BLIP Image Captioning Large

Salesforce/blip-image-captioning-large

published Dec 2022 · updated Feb 2025

BLIP Image Captioning Large is an image-to-text model that generates descriptive captions for images using a Vision-Language Pre-training framework.

est. price
~$0.094
/ 1k images · estimated, set at launch
API providers
0
downloads / mo
752.9K
license
bsd-3-clause

specs

TaskImage-to-text (image captioning)
ArchitectureVision Transformer Large (ViT-L) backbone with multimodal encoder-decoder
DatasetCOCO (pretrained on 129M image-text pairs)
LicenseBSD-3-Clause

about this model

Salesforce/blip-image-captioning-large is an image-to-text model that generates descriptive captions for images, built on the BLIP (Bootstrapping Language-Image Pre-training) framework with a Vision Transformer large (ViT-L) backbone. It can produce both unconditional captions and conditional captions guided by a text prompt.

Model Description

BLIP unifies vision-language understanding and generation in a single framework, bootstrapping noisy web data by generating synthetic captions and filtering them. This approach achieves state-of-the-art results across multiple tasks without requiring task-specific architectures.

Key Strengths

  • Unified architecture – transfers to both understanding tasks (e.g., retrieval) and generation tasks (e.g., captioning, VQA).
  • Bootstrapping method – a captioner creates synthetic captions and a filter removes noise, effectively leveraging large-scale web data.
  • Strong generalization – demonstrates zero-shot performance on video-language tasks.

Performance Benchmarks

On the COCO dataset, BLIP with ViT-L achieves the following improvements over prior state-of-the-art models:

  • Image-text retrieval: +2.7% in average recall@1
  • Image captioning: +2.8% in CIDEr score
  • Visual Question Answering: +1.6% in VQA score

These results are reported in the BLIP paper (arXiv:2201.12086). The model is pre-trained on 129 million images and fine-tuned for captioning on COCO.

Usage via Gigarouter

Gigarouter hosts this model as a managed, OpenAI-compatible API. Users send image data and receive generated captions without managing infrastructure. The model supports both conditional and unconditional captioning through a single API endpoint.

best for

FAQ

What tasks can BLIP Image Captioning Large perform?

It performs image captioning (both conditional and unconditional), image-text retrieval, and visual question answering when finetuned, but this model is specialized for captioning.

How does this model compare to the smaller BLIP base version?

BLIP Large uses a ViT-L backbone with more parameters, resulting in higher accuracy on captioning benchmarks like CIDEr (+2.8% over prior work). It is slower but more accurate.

What input format does the model expect?

The model accepts an image (processed to RGB tensor) and optionally a text prompt for conditional captioning. The output is a natural language caption string.

How can I use this model via the gigarouter API?

Call the gigarouter OpenAI-compatible endpoint with an API key, send the image as a base64-encoded string or URL, and receive the generated caption in the response.

Is this model free to use?

Yes, it is released under the BSD-3-Clause license, allowing commercial use with attribution. However, the GitHub repo is deprecated; the recommended library is LAVIS.

not yet live

We're benchmarking and onboarding BLIP Image Captioning Large as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →