Question 1

What is the BLIP ITM Base Flickr model best for?

Accepted Answer

It is best for image-text matching tasks, such as scoring how well a caption describes an image, and for image-text retrieval on Flickr30k data.

Question 2

How does this model compare in size to other BLIP variants?

Accepted Answer

This is the base version with a ViT-B backbone, pre-trained on 129M images. It is smaller than the ViT-L variant and does not use the CapFilt large captioning/filtering stage.

Question 3

What are the input and output formats for this model?

Accepted Answer

Input: an image and a text string. Output: an image-text matching score (logit) and optionally a cosine similarity score when using the ITM head or not.

Question 4

How can I call this model via the API?

Accepted Answer

Use the gigarouter OpenAI-compatible endpoint with your API key, sending image and text inputs as per the API documentation.

Question 5

Is this model still actively maintained?

Accepted Answer

The original BLIP codebase is deprecated and no longer supported; the model is now integrated into the LAVIS library. The hosted API on gigarouter remains available for inference.

Task	Image-Text Matching
Architecture	BLIP with ViT-B (Vision Transformer Base) backbone
Parameters	Not specified
License	Not specified

BLIP ITM Base Flickr

specs

about this model

Capabilities

Training and Performance

Usage via gigarouter

best for

FAQ

related specialist model models