skip to content
gigarouter gigarouter
models / zero-shot image · coming soon

Chinese CLIP ViT-Base-Patch16

OFA-Sys/chinese-clip-vit-base-patch16

published Nov 2022 · updated Dec 2022

Chinese CLIP ViT-Base-Patch16 is a zero-shot-image model that computes image and text embeddings and similarity scores for Chinese language content, using a ViT-B/16 image encoder and a RoBERTa-wwm-base text encoder.

status
coming soon
API providers
0
downloads / mo
199.7K

specs

TaskZero-shot image-text retrieval, zero-shot image classification, cross-modal similarity computation
ArchitectureViT-B/16 image encoder + RoBERTa-wwm-base text encoder
Training Data~200 million Chinese image-text pairs

about this model

Chinese-CLIP-ViT-Base-Patch16 is a zero-shot image-text model that computes cross-modal embeddings and similarity scores for Chinese-language content, using ViT-B/16 as the image encoder and RoBERTa-wwm-base as the text encoder. It was trained on a large-scale dataset of approximately 200 million Chinese image-text pairs using a two-stage pretraining method: first training with the image encoder frozen, then optimizing all parameters jointly.

Key Capabilities

The model supports zero-shot image classification, text-to-image retrieval, and image-to-text retrieval without task-specific fine-tuning. It can compute normalized image and text features and produce similarity scores between visual and textual inputs.

Benchmark Performance

On the MUGE text-to-image retrieval benchmark, the model achieves zero-shot R@1 of 63.0, R@5 of 84.1, and R@10 of 89.2. On Flickr30K-CN, zero-shot text-to-image retrieval yields R@1 of 71.2, R@5 of 91.4, and R@10 of 95.5; image-to-text retrieval achieves R@1 of 81.6, R@5 of 97.5, and R@10 of 98.8. On COCO-CN, zero-shot text-to-image retrieval reaches R@1 of 69.2, R@5 of 89.9, and R@10 of 96.1; image-to-text retrieval achieves R@1 of 63.0, R@5 of 86.6, and R@10 of 92.9.

In zero-shot image classification across ten datasets (CIFAR10, CIFAR100, DTD, EuroSAT, FER, FGVC, KITTI, MNIST, PC, VOC), the model achieves 96.0% on CIFAR10 and 79.7% on CIFAR100, with competitive results on the remaining tasks.

Architecture and Training

The model is one of five Chinese CLIP variants ranging from 77 million to 958 million parameters. It was trained on publicly available Chinese image-text datasets and evaluated on the ELEVATER benchmark for zero-shot classification. Further details are available in the technical report (arXiv:2211.01335).

best for

FAQ

What is Chinese CLIP ViT-Base-Patch16 best used for?

It excels at zero-shot image-text retrieval and classification with Chinese language queries, such as searching images using Chinese text or classifying images without fine-tuning.

What input formats does the model expect?

It accepts images (as PIL or file) and Chinese text strings. The processor handles resizing and tokenization for both modalities.

How can I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your API key. Send a request with an image URL or base64 and a list of Chinese text prompts to get similarity scores.

Does the model support fine-tuning?

Yes, the model card mentions fine-tuning results on MUGE, Flickr30K-CN, and COCO-CN. The paper also describes a two-stage pretraining method.

What is the size of this model compared to other Chinese CLIP variants?

The paper introduces five model sizes from 77M to 958M parameters. The base version (ViT-B/16) is one of the smaller sizes, offering a balance of speed and accuracy.

not yet live

We're benchmarking and onboarding Chinese CLIP ViT-Base-Patch16 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →