BLIP ITM Large Flickr

Salesforce/blip-itm-large-flickr

published Dec 2022 · updated Feb 2025

BLIP ITM Large Flickr is a vision-language model fine-tuned for image-text matching on the Flickr30k dataset using a ViT large backbone.

status

coming soon

API providers

downloads / mo

license

bsd-3-clause

specs

Task	Image-Text Matching
Architecture	BLIP with ViT large backbone
License	Creative Commons Attribution 4.0 International

about this model

Salesforce/blip-itm-large-flickr is an image-text matching model that computes the alignment score between an image and a text caption, built on the BLIP (Bootstrapping Language-Image Pre-training) framework with a ViT large backbone and fine-tuned on the Flickr30k dataset.

BLIP is a unified vision-language pre-training framework designed to transfer flexibly to both understanding and generation tasks. Unlike earlier VLP models that specialized in one type of task, BLIP handles both. It addresses the problem of noisy web-sourced image-text pairs through a bootstrapping approach: a captioner generates synthetic captions and a filter removes noisy ones, enabling more effective use of large-scale web data.

Architecture and Capabilities

This specific variant uses a ViT-Large backbone and is fine-tuned on Flickr30k for the image-text matching (ITM) task. The model computes an alignment score between an input image and a text caption, outputting both an ITM score (via a dedicated matching head) and a cosine similarity score. It can be used for conditional and unconditional image captioning as well as multimodal feature extraction.

Benchmark Performance

On the BLIP framework overall, the authors report state-of-the-art results across multiple vision-language tasks:

Image-text retrieval: +2.7% in average recall@1
Image captioning: +2.8% in CIDEr
Visual question answering: +1.6% in VQA score

The model also demonstrates strong zero-shot generalization to video-language tasks.

Model Variants

The BLIP framework includes multiple pre-trained and fine-tuned checkpoints. This model is the large variant (ViT-L) fine-tuned on Flickr30k for image-text retrieval. The BLIP framework is now integrated into the LAVIS library.

Ethical Considerations

This model is released for research purposes. It has not been specifically designed or evaluated for all downstream applications. Users should assess accuracy, safety, and fairness before deploying, particularly in high-risk scenarios.

best for

·Scoring image-text pairs for relevance
·Retrieving images that match a given caption
·Retrieving captions that match a given image

FAQ

What is BLIP ITM Large Flickr best for?

It is best for image-text matching tasks, such as scoring how well a caption describes an image or retrieving relevant images/captions.

What architecture does this model use?

It uses a BLIP framework with a ViT large backbone, fine-tuned on Flickr30k.

What is the license for this model?

It is released under the Creative Commons Attribution 4.0 International license.

How do I call this model via the API?

Use the gigarouter OpenAI-compatible endpoint with your API key to send image and text inputs for matching scores.

What input format does the model expect?

It expects an image and a text caption; the processor tokenizes the text and preprocesses the image for the model.

not yet live

We're benchmarking and onboarding BLIP ITM Large Flickr as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related specialist model models

compare all →

electra-base-discriminator

wespeaker-voxceleb-resnet34-LM

6.8M dl/mo

unidepth-v2-vitl14

6.3M dl/mo