BLIP Image Captioning Base

Salesforce/blip-image-captioning-base

published Dec 2022 · updated Feb 2025

BLIP Image Captioning Base is an image-to-text model that generates descriptive captions for images using a Vision Transformer (ViT-B) backbone, pretrained on the COCO dataset.

status

coming soon

API providers

downloads / mo

1.9M

license

bsd-3-clause

specs

Task	Image-to-Text (Image Captioning)
Architecture	BLIP with ViT-B (Vision Transformer Base) backbone
License	Creative Commons Attribution 4.0 International (paper)

about this model

Salesforce BLIP Image Captioning Base is an image-to-text model that generates descriptive captions for images using a Vision Transformer (ViT-B) backbone, pretrained on the COCO dataset. It supports both conditional captioning (given a text prompt) and unconditional captioning.

Capabilities

BLIP (Bootstrapping Language-Image Pre-training) is a unified vision-language framework that handles both understanding and generation tasks. A captioner produces synthetic captions from noisy web data, and a filter removes low-quality ones, improving supervision without requiring clean datasets. The base model achieves state-of-the-art results on benchmarks:

Image-Text Retrieval (COCO): Text retrieval R1=82.0, R5=95.8, R10=98.1; image retrieval R1=64.5, R5=86.0, R10=91.7.
Image-Text Retrieval (Flickr30k): Text retrieval R1=96.9, R5=99.9, R10=100.0; image retrieval R1=87.5, R5=97.6, R10=98.9.
Visual Question Answering (VQAv2): test-dev score 78.23, test-std 78.29.
Image Captioning (COCO): BLEU@4=39.9, CIDEr=133.5, SPICE=23.7.
Image Captioning (NoCaps): BLEU@4=31.9, CIDEr=109.1, SPICE=14.7.

Architecture and Training

This variant uses a ViT-B backbone pretrained on 129M images. The model integrates bootstrapping (CapFilt) to leverage noisy web data effectively. The BLIP framework also transfers zero-shot to video-language tasks. The original repo is deprecated; the model is now maintained as part of the LAVIS library.

Benchmark Summary

Task	Dataset	Metric	Score
Image Captioning	COCO	CIDEr	133.5
Image Captioning	NoCaps	CIDEr	109.1
VQA	VQAv2 (test-dev)	Accuracy	78.23
Image-Text Retrieval	COCO	TR R1	82.0
Image-Text Retrieval	Flickr30k	TR R1	96.9

This model is hosted by gigarouter as a managed, OpenAI-compatible API—no local installation required.

best for

·Unconditional image captioning for accessibility descriptions
·Conditional image captioning with text prompts (e.g., "a photography of")
·Generating captions for social media or e-commerce product images

FAQ

What is this model best for?

BLIP Image Captioning Base is best for generating both unconditional and conditional captions for images, using a lightweight ViT-B architecture.

How does this model compare to larger BLIP variants?

BLIP Base uses ViT-B (smaller) while larger variants like ViT-L or CapFilt-L achieve higher CIDEr scores on COCO captioning (133.5 vs 136.7) but require more compute.

What license does this model use?

The model is released under the Creative Commons Attribution 4.0 International license as per the paper.

What are the input and output formats?

Input: an image file (e.g., JPEG/PNG) and an optional text prompt. Output: a string containing the generated caption.

How can I call this model via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image as a URL or base64-encoded data in the request.

not yet live

We're benchmarking and onboarding BLIP Image Captioning Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo

PP-LCNet_x1_0_doc_ori

445.3K dl/mo