BLIP Image Captioning Base
Salesforce/blip-image-captioning-base
published Dec 2022 · updated Feb 2025
BLIP Image Captioning Base is an image-to-text model that generates descriptive captions for images using a Vision Transformer (ViT-B) backbone, pretrained on the COCO dataset.
specs
| Task | Image-to-Text (Image Captioning) |
| Architecture | BLIP with ViT-B (Vision Transformer Base) backbone |
| License | Creative Commons Attribution 4.0 International (paper) |
about this model
Salesforce BLIP Image Captioning Base is an image-to-text model that generates descriptive captions for images using a Vision Transformer (ViT-B) backbone, pretrained on the COCO dataset. It supports both conditional captioning (given a text prompt) and unconditional captioning.
Capabilities
BLIP (Bootstrapping Language-Image Pre-training) is a unified vision-language framework that handles both understanding and generation tasks. A captioner produces synthetic captions from noisy web data, and a filter removes low-quality ones, improving supervision without requiring clean datasets. The base model achieves state-of-the-art results on benchmarks:
- Image-Text Retrieval (COCO): Text retrieval R1=82.0, R5=95.8, R10=98.1; image retrieval R1=64.5, R5=86.0, R10=91.7.
- Image-Text Retrieval (Flickr30k): Text retrieval R1=96.9, R5=99.9, R10=100.0; image retrieval R1=87.5, R5=97.6, R10=98.9.
- Visual Question Answering (VQAv2): test-dev score 78.23, test-std 78.29.
- Image Captioning (COCO): BLEU@4=39.9, CIDEr=133.5, SPICE=23.7.
- Image Captioning (NoCaps): BLEU@4=31.9, CIDEr=109.1, SPICE=14.7.
Architecture and Training
This variant uses a ViT-B backbone pretrained on 129M images. The model integrates bootstrapping (CapFilt) to leverage noisy web data effectively. The BLIP framework also transfers zero-shot to video-language tasks. The original repo is deprecated; the model is now maintained as part of the LAVIS library.
Benchmark Summary
| Task | Dataset | Metric | Score |
|---|---|---|---|
| Image Captioning | COCO | CIDEr | 133.5 |
| Image Captioning | NoCaps | CIDEr | 109.1 |
| VQA | VQAv2 (test-dev) | Accuracy | 78.23 |
| Image-Text Retrieval | COCO | TR R1 | 82.0 |
| Image-Text Retrieval | Flickr30k | TR R1 | 96.9 |
This model is hosted by gigarouter as a managed, OpenAI-compatible API—no local installation required.
best for
- ·Unconditional image captioning for accessibility descriptions
- ·Conditional image captioning with text prompts (e.g., "a photography of")
- ·Generating captions for social media or e-commerce product images
FAQ
BLIP Image Captioning Base is best for generating both unconditional and conditional captions for images, using a lightweight ViT-B architecture.
BLIP Base uses ViT-B (smaller) while larger variants like ViT-L or CapFilt-L achieve higher CIDEr scores on COCO captioning (133.5 vs 136.7) but require more compute.
The model is released under the Creative Commons Attribution 4.0 International license as per the paper.
Input: an image file (e.g., JPEG/PNG) and an optional text prompt. Output: a string containing the generated caption.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending the image as a URL or base64-encoded data in the request.
We're benchmarking and onboarding BLIP Image Captioning Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.