Chinese CLIP ViT-Base-Patch16
OFA-Sys/chinese-clip-vit-base-patch16
published Nov 2022 · updated Dec 2022
Chinese CLIP ViT-Base-Patch16 is a zero-shot-image model that computes image and text embeddings and similarity scores for Chinese language content, using a ViT-B/16 image encoder and a RoBERTa-wwm-base text encoder.
specs
| Task | Zero-shot image-text retrieval, zero-shot image classification, cross-modal similarity computation |
| Architecture | ViT-B/16 image encoder + RoBERTa-wwm-base text encoder |
| Training Data | ~200 million Chinese image-text pairs |
about this model
Key Capabilities
The model supports zero-shot image classification, text-to-image retrieval, and image-to-text retrieval without task-specific fine-tuning. It can compute normalized image and text features and produce similarity scores between visual and textual inputs.
Benchmark Performance
On the MUGE text-to-image retrieval benchmark, the model achieves zero-shot R@1 of 63.0, R@5 of 84.1, and R@10 of 89.2. On Flickr30K-CN, zero-shot text-to-image retrieval yields R@1 of 71.2, R@5 of 91.4, and R@10 of 95.5; image-to-text retrieval achieves R@1 of 81.6, R@5 of 97.5, and R@10 of 98.8. On COCO-CN, zero-shot text-to-image retrieval reaches R@1 of 69.2, R@5 of 89.9, and R@10 of 96.1; image-to-text retrieval achieves R@1 of 63.0, R@5 of 86.6, and R@10 of 92.9.
In zero-shot image classification across ten datasets (CIFAR10, CIFAR100, DTD, EuroSAT, FER, FGVC, KITTI, MNIST, PC, VOC), the model achieves 96.0% on CIFAR10 and 79.7% on CIFAR100, with competitive results on the remaining tasks.
Architecture and Training
The model is one of five Chinese CLIP variants ranging from 77 million to 958 million parameters. It was trained on publicly available Chinese image-text datasets and evaluated on the ELEVATER benchmark for zero-shot classification. Further details are available in the technical report (arXiv:2211.01335).
best for
- ·Chinese image-text retrieval (e.g., search products by Chinese description)
- ·Zero-shot classification of Chinese images (e.g., categorize photos with Chinese labels)
- ·Cross-modal similarity scoring for Chinese content moderation or recommendation
FAQ
It excels at zero-shot image-text retrieval and classification with Chinese language queries, such as searching images using Chinese text or classifying images without fine-tuning.
It accepts images (as PIL or file) and Chinese text strings. The processor handles resizing and tokenization for both modalities.
Use the OpenAI-compatible endpoint with your API key. Send a request with an image URL or base64 and a list of Chinese text prompts to get similarity scores.
Yes, the model card mentions fine-tuning results on MUGE, Flickr30K-CN, and COCO-CN. The paper also describes a two-stage pretraining method.
The paper introduces five model sizes from 77M to 958M parameters. The base version (ViT-B/16) is one of the smaller sizes, offering a balance of speed and accuracy.
We're benchmarking and onboarding Chinese CLIP ViT-Base-Patch16 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.