AltCLIP

BAAI/AltCLIP

published Nov 2022 · updated Apr 2025

AltCLIP is a bilingual Chinese-English zero-shot image-text representation model that aligns images with text using a modified CLIP architecture with an XLM-R text encoder.

status

coming soon

API providers

downloads / mo

112.2K

license

creativeml-openrail-m

specs

Task	Zero-Shot Image-Text Representation & Retrieval
Architecture	ViT-L image encoder + XLM-R text encoder
Languages	Chinese, English
Model Size	3.22 GB
Training Data	WuDao dataset & LAION-5B

about this model

AltCLIP is a bilingual (Chinese-English) multimodal representation model that extends OpenAI's CLIP by replacing its text encoder with XLM-R, enabling strong zero-shot image classification and cross-modal retrieval in both languages. The model uses a two-stage training process: first, parallel knowledge distillation (teacher learning) on large-scale text corpora, then bilingual contrastive learning on approximately 2 million Chinese-English image-text pairs. Both AltCLIP and its 9-language variant AltCLIP-m9 (supporting English, Chinese, Spanish, French, Russian, Japanese, Korean, Arabic, and Italian) use a ViT-L image encoder and have a model size of 3.22 GB. On the Flickr30k benchmark, AltCLIP achieves the following retrieval performance (R@1/R@5/R@10):

English

Task	R@1	R@5	R@10
Text-to-Image	66.3	87.8	92.7
Image-to-Text	85.9	97.7	99.1

Chinese

Task	R@1	R@5	R@10
Text-to-Image	63.7	86.3	92.1
Image-to-Text	84.7	97.4	98.7

The enhanced variant AltCLIP* (trained with additional data) sets new state-of-the-art results on ImageNet-CN, Flickr30k-CN, COCO-CN, and XTD, achieving a mean recall of 90.4 on English and 89.2 on Chinese tasks. Performance comparison chart showing AltCLIP results across retrieval tasks

Performance comparison chart showing AltCLIP results across retrieval tasks

AltCLIP also serves as the text-image backbone for the AltDiffusion text-to-image generation model. Visualization of images generated by AltDiffusion model based on AltCLIP

Visualization of images generated by AltDiffusion model based on AltCLIP

best for

·Bilingual (Chinese-English) image and text retrieval
·Zero-shot image classification and captioning
·Powering multilingual text-to-image generation (AltDiffusion)

FAQ

What is AltCLIP?

AltCLIP is a bilingual Chinese-English zero-shot image-text model based on CLIP, using XLM-R as its text encoder for multilingual capabilities.

What languages does AltCLIP support?

It supports Chinese and English. A variant AltCLIP-m9 also supports Spanish, French, Russian, Japanese, Korean, Arabic, and Italian.

How does AltCLIP compare to OpenAI's CLIP?

AltCLIP achieves similar performance to CLIP on English tasks while adding strong Chinese and multilingual capabilities, setting state-of-the-art on Chinese retrieval benchmarks.

What are the input and output formats?

Input: text strings and images (as PIL or URLs). Output: similarity scores or embeddings for text-image pairs.

How can I call AltCLIP via the API?

Use the gigarouter OpenAI-compatible endpoint with an API key, sending prompts and images as specified in the documentation.

not yet live

We're benchmarking and onboarding AltCLIP as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336