skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

BiomedCLIP

microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

published Apr 2023 · updated Jan 2025

BiomedCLIP is a zero-shot-image model that performs biomedical vision-language tasks such as cross-modal retrieval, image classification, and visual question answering using contrastive learning on 15 million image-text pairs from PubMed Central.

status
coming soon
API providers
0
downloads / mo
565.8K
license
mit

specs

TaskZero-shot Image Classification, Cross-modal Retrieval, Visual Question Answering
ArchitecturePubMedBERT text encoder + Vision Transformer (ViT-B/16) image encoder
LicenseMIT

about this model

BiomedCLIP-PubMedBERT_256-vit_base_patch16_224 is a zero-shot image classification model that performs biomedical vision-language processing (VLP) tasks including cross-modal retrieval, image classification, and visual question answering. It uses PubMedBERT as the text encoder and a Vision Transformer (ViT-B/16) as the image encoder, pretrained on the PMC-15M dataset — 15 million figure-caption pairs extracted from 4.4 million scientific articles in PubMed Central. This dataset is two orders of magnitude larger than prior biomedical multimodal datasets such as MIMIC-CXR.

Comparison of BiomedCLIP performance against prior VLP approaches across multiple biomedical benchmarks

BiomedCLIP establishes new state-of-the-art results on a broad set of standard biomedical imaging benchmarks. It outperforms both general-purpose VLP models and radiology-specific models such as BioViL on radiology tasks, including RSNA pneumonia detection. The model is evaluated on retrieval, classification, and visual question-answering tasks through extensive experiments and ablation studies. It can also serve as a privacy-preserving proxy for analyzing proprietary biomedical data.

The model was trained on English-language corpora and is therefore limited to English text inputs. It is released under the MIT license.

best for

FAQ

What is BiomedCLIP best for?

It excels at biomedical zero-shot image classification, cross-modal retrieval, and visual question answering, with state-of-the-art performance on standard benchmarks.

What are the input and output formats?

Input: an image (preprocessed) and a text prompt (e.g., "this is a photo of [label]"). Output: similarity logits or classification probabilities.

What license does BiomedCLIP use?

It is released under the MIT license.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key; see the gigarouter documentation for details.

What was the training data?

It was trained on PMC-15M, a dataset of 15 million figure-caption pairs from 4.4 million biomedical articles.

not yet live

We're benchmarking and onboarding BiomedCLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →