skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

AltCLIP

BAAI/AltCLIP

published Nov 2022 · updated Apr 2025

AltCLIP is a bilingual Chinese-English zero-shot image-text representation model that aligns images with text using a modified CLIP architecture with an XLM-R text encoder.

status
coming soon
API providers
0
downloads / mo
112.2K
license
creativeml-openrail-m

specs

TaskZero-Shot Image-Text Representation & Retrieval
ArchitectureViT-L image encoder + XLM-R text encoder
LanguagesChinese, English
Model Size3.22 GB
Training DataWuDao dataset & LAION-5B

about this model

AltCLIP is a bilingual (Chinese-English) multimodal representation model that extends OpenAI's CLIP by replacing its text encoder with XLM-R, enabling strong zero-shot image classification and cross-modal retrieval in both languages. The model uses a two-stage training process: first, parallel knowledge distillation (teacher learning) on large-scale text corpora, then bilingual contrastive learning on approximately 2 million Chinese-English image-text pairs. Both AltCLIP and its 9-language variant AltCLIP-m9 (supporting English, Chinese, Spanish, French, Russian, Japanese, Korean, Arabic, and Italian) use a ViT-L image encoder and have a model size of 3.22 GB. On the Flickr30k benchmark, AltCLIP achieves the following retrieval performance (R@1/R@5/R@10):

English

TaskR@1R@5R@10
Text-to-Image66.387.892.7
Image-to-Text85.997.799.1

Chinese

TaskR@1R@5R@10
Text-to-Image63.786.392.1
Image-to-Text84.797.498.7
The enhanced variant AltCLIP* (trained with additional data) sets new state-of-the-art results on ImageNet-CN, Flickr30k-CN, COCO-CN, and XTD, achieving a mean recall of 90.4 on English and 89.2 on Chinese tasks. Performance comparison chart showing AltCLIP results across retrieval tasks AltCLIP also serves as the text-image backbone for the AltDiffusion text-to-image generation model. Visualization of images generated by AltDiffusion model based on AltCLIP

best for

FAQ

What is AltCLIP?

AltCLIP is a bilingual Chinese-English zero-shot image-text model based on CLIP, using XLM-R as its text encoder for multilingual capabilities.

What languages does AltCLIP support?

It supports Chinese and English. A variant AltCLIP-m9 also supports Spanish, French, Russian, Japanese, Korean, Arabic, and Italian.

How does AltCLIP compare to OpenAI's CLIP?

AltCLIP achieves similar performance to CLIP on English tasks while adding strong Chinese and multilingual capabilities, setting state-of-the-art on Chinese retrieval benchmarks.

What are the input and output formats?

Input: text strings and images (as PIL or URLs). Output: similarity scores or embeddings for text-image pairs.

How can I call AltCLIP via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key, sending prompts and images as specified in the documentation.

not yet live

We're benchmarking and onboarding AltCLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →