ViT GPT-2 Image Captioning

nlpconnect/vit-gpt2-image-captioning

published Mar 2022 · updated Feb 2023

ViT GPT-2 Image Captioning is an image-to-text model that generates captions for images using a Vision Transformer encoder and a GPT-2 decoder.

status

coming soon

API providers

downloads / mo

64.4K

license

apache-2.0

specs

Task	Image-to-Text (Image Captioning)
Architecture	Vision Encoder-Decoder (ViT encoder + GPT-2 decoder)
Dataset	COCO 2017

about this model

nlpconnect/vit-gpt2-image-captioning is an image-to-text model that generates captions for input images using a Vision Transformer (ViT) encoder and a GPT-2 decoder. The encoder is google/vit-base-patch16-224-in21k and the decoder is GPT-2. The model was fine-tuned on the COCO 2017 dataset as a proof-of-concept for the FlaxVisionEncoderDecoder framework; it is not a state-of-the-art model and no benchmark scores (e.g., BLEU, CIDEr) are publicly available.

Diagram illustrating the image captioning process using transformers

Generation parameters used in the reference implementation: max_length=16 and num_beams=4. Sample outputs from the model card show a tendency toward repetitive captions (e.g., "a woman in a hospital bed with a woman in a hospital bed"), which may indicate a limitation in generation quality. The model weights have approximately 4,900 all-time downloads on Hugging Face.

Gigarouter hosts this model as a managed, OpenAI-compatible API, allowing developers to integrate image captioning without managing infrastructure or dependencies.

best for

·Generating descriptive captions for user-uploaded images
·Assisting visually impaired users with image descriptions
·Automating image metadata for content management systems

FAQ

What is the architecture of ViT GPT-2 Image Captioning?

It uses a Vision Transformer (ViT) as the encoder and GPT-2 as the decoder, forming a VisionEncoderDecoderModel.

What dataset was this model trained on?

It was fine-tuned on the COCO 2017 dataset for image captioning.

Is this model state-of-the-art?

No, it is a proof-of-concept fine-tuned with the FlaxVisionEncoderDecoder framework and is not intended to be state-of-the-art.

How can I use this model via the gigarouter API?

Send a POST request to the gigarouter OpenAI-compatible endpoint with your API key and an image URL or base64 data; the response will contain the generated caption.

What are the input and output formats?

Input is an image (URL or base64), output is a plain text caption string.

not yet live

We're benchmarking and onboarding ViT GPT-2 Image Captioning as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related image-to-text models

compare all →

blip-image-captioning-base

1.9M dl/mo

blip-image-captioning-large

trocr-small-handwritten

448.6K dl/mo