skip to content
gigarouter gigarouter
models / specialist model · coming soon

BLIP ITM Base Flickr

Salesforce/blip-itm-base-flickr

published Dec 2022 · updated Feb 2025

BLIP ITM Base Flickr is a vision-language model for image-text matching, fine-tuned on Flickr30k.

status
coming soon
API providers
0
downloads / mo
119
license
bsd-3-clause

specs

TaskImage-Text Matching
ArchitectureBLIP with ViT-B (Vision Transformer Base) backbone
ParametersNot specified
LicenseNot specified

about this model

Salesforce/blip-itm-base-flickr is an image-text matching (ITM) model that computes the alignment score between an image and a text caption, built on the BLIP framework with a ViT-Base backbone and finetuned on the Flickr30k dataset.

Capabilities

The model accepts an image and a text description and returns both an image-text matching score (via a dedicated ITM head) and a cosine similarity score from the multimodal encoder. It is designed for tasks such as image-text retrieval and verifying caption accuracy. BLIP uniquely handles noisy web data by bootstrapping captions: a captioner generates synthetic captions and a filter removes noisy ones, enabling effective pre-training on large-scale web data.

Training and Performance

The ViT-Base backbone was pre-trained on 129 million image-text pairs and then finetuned on Flickr30k for retrieval. In the BLIP paper, the framework achieved state-of-the-art results: image-text retrieval improved by +2.7% in average recall@1, image captioning by +2.8% in CIDEr, and VQA by +1.6% in VQA score. The model also generalizes to video-language tasks in a zero-shot manner.

Usage via gigarouter

gigarouter hosts this model as a managed, OpenAI-compatible API. No local installation or GPU infrastructure is required. Developers send image and text inputs to the API endpoint and receive the matching scores in return.

best for

FAQ

What is the BLIP ITM Base Flickr model best for?

It is best for image-text matching tasks, such as scoring how well a caption describes an image, and for image-text retrieval on Flickr30k data.

How does this model compare in size to other BLIP variants?

This is the base version with a ViT-B backbone, pre-trained on 129M images. It is smaller than the ViT-L variant and does not use the CapFilt large captioning/filtering stage.

What are the input and output formats for this model?

Input: an image and a text string. Output: an image-text matching score (logit) and optionally a cosine similarity score when using the ITM head or not.

How can I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending image and text inputs as per the API documentation.

Is this model still actively maintained?

The original BLIP codebase is deprecated and no longer supported; the model is now integrated into the LAVIS library. The hosted API on gigarouter remains available for inference.

not yet live

We're benchmarking and onboarding BLIP ITM Base Flickr as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related specialist model models

compare all →