skip to content
gigarouter gigarouter
models / image-to-text · coming soon

BLIP Image Captioning Base

Salesforce/blip-image-captioning-base

published Dec 2022 · updated Feb 2025

BLIP Image Captioning Base is an image-to-text model that generates descriptive captions for images using a Vision Transformer (ViT-B) backbone, pretrained on the COCO dataset.

status
coming soon
API providers
0
downloads / mo
1.9M
license
bsd-3-clause

specs

TaskImage-to-Text (Image Captioning)
ArchitectureBLIP with ViT-B (Vision Transformer Base) backbone
LicenseCreative Commons Attribution 4.0 International (paper)

about this model

Salesforce BLIP Image Captioning Base is an image-to-text model that generates descriptive captions for images using a Vision Transformer (ViT-B) backbone, pretrained on the COCO dataset. It supports both conditional captioning (given a text prompt) and unconditional captioning.

Capabilities

BLIP (Bootstrapping Language-Image Pre-training) is a unified vision-language framework that handles both understanding and generation tasks. A captioner produces synthetic captions from noisy web data, and a filter removes low-quality ones, improving supervision without requiring clean datasets. The base model achieves state-of-the-art results on benchmarks:

  • Image-Text Retrieval (COCO): Text retrieval R1=82.0, R5=95.8, R10=98.1; image retrieval R1=64.5, R5=86.0, R10=91.7.
  • Image-Text Retrieval (Flickr30k): Text retrieval R1=96.9, R5=99.9, R10=100.0; image retrieval R1=87.5, R5=97.6, R10=98.9.
  • Visual Question Answering (VQAv2): test-dev score 78.23, test-std 78.29.
  • Image Captioning (COCO): BLEU@4=39.9, CIDEr=133.5, SPICE=23.7.
  • Image Captioning (NoCaps): BLEU@4=31.9, CIDEr=109.1, SPICE=14.7.

Architecture and Training

This variant uses a ViT-B backbone pretrained on 129M images. The model integrates bootstrapping (CapFilt) to leverage noisy web data effectively. The BLIP framework also transfers zero-shot to video-language tasks. The original repo is deprecated; the model is now maintained as part of the LAVIS library.

Benchmark Summary

TaskDatasetMetricScore
Image CaptioningCOCOCIDEr133.5
Image CaptioningNoCapsCIDEr109.1
VQAVQAv2 (test-dev)Accuracy78.23
Image-Text RetrievalCOCOTR R182.0
Image-Text RetrievalFlickr30kTR R196.9

This model is hosted by gigarouter as a managed, OpenAI-compatible API—no local installation required.

best for

FAQ

What is this model best for?

BLIP Image Captioning Base is best for generating both unconditional and conditional captions for images, using a lightweight ViT-B architecture.

How does this model compare to larger BLIP variants?

BLIP Base uses ViT-B (smaller) while larger variants like ViT-L or CapFilt-L achieve higher CIDEr scores on COCO captioning (133.5 vs 136.7) but require more compute.

What license does this model use?

The model is released under the Creative Commons Attribution 4.0 International license as per the paper.

What are the input and output formats?

Input: an image file (e.g., JPEG/PNG) and an optional text prompt. Output: a string containing the generated caption.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image as a URL or base64-encoded data in the request.

not yet live

We're benchmarking and onboarding BLIP Image Captioning Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →