CLIP ConvNeXt-Large D 320
laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup
published Feb 2023 · updated Jan 2025
CLIP ConvNeXt-Large D 320 is a zero-shot-image model that uses a ConvNeXt-Large image tower with an MLP head and a deeper text tower, fine-tuned at 320x320 resolution via model soup averaging for improved zero-shot classification.
specs
| Task | Zero-shot image classification and image-text retrieval |
| Architecture | ConvNeXt-Large image tower (with MLP head) + deeper text tower (depth 16, embed dim 768) |
| Parameters | ~200M (ConvNeXt-Large base) — 1.22x fewer than ViT-L/14 |
| License | MIT |
about this model
laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup is a zero-shot image classification model that performs contrastive language-image pretraining (CLIP) using a ConvNeXt-Large vision backbone with an MLP head and a deeper text tower (depth 16, embed dim 768). It is a weight-averaged soup of three fine-tunes of the base 256x256 ConvNeXt-Large-D model, each fine-tuned at 320x320 resolution on LAION-2B for an additional 2–3 billion samples with different learning rates (1e-4, 6e-5, 5e-5).
The model achieves 76.9% top-1 zero-shot accuracy on ImageNet-1k, improving over the single fine-tune (76.6%) and the base 256x256 model (75.9%). At 320x320 resolution, it is significantly more efficient than OpenAI's ViT-L/14 at 336x336: 2.5x fewer GMACs, 2.8x fewer activations, and 1.22x fewer parameters.
Architecture and Training
- Vision tower: ConvNeXt-Large (timm
convnext_large) with an MLP head (fc–gelu–drop–fc) instead of a single projection. - Text tower: 4 layers deeper than ViT-L / RN50x16 models (depth 16, embed dim 768).
- Training data: LAION-2B (English subset of LAION-5B, 2.32B image-text pairs).
- Augmentation: Random Resize Crop (0.5, 1.0), Random Erasing (prob 0.4), Stochastic Depth (prob 0.1) on image tower, no dropout on head.
- Batch size: 131,072; trained on 64 nodes of 8×A100 40GB GPUs (Stability AI cluster).
Benchmark Results
| Model | Resolution | ImageNet Zero-Shot Top-1 (%) |
|---|---|---|
| convnext_large_d (base) | 256x256 | 75.9 |
| convnext_large_d_320 (single ft) | 320x320 | 76.6 |
| convnext_large_d_320 (soup) | 320x320 | 76.9 |
Additional benchmarks across the VTAB+ suite (VTAB plus robustness datasets) and COCO/Flickr retrieval are available in the LAION CLIP Benchmark repository.
Licensed under MIT. This model is intended for research use; deployed applications require thorough domain-specific testing due to variability in zero-shot performance across class taxonomies.
best for
- ·Zero-shot image classification with custom class taxonomies
- ·Image-text retrieval (search images by text or vice versa)
- ·Fine-tuning for downstream image classification tasks
FAQ
The model accepts images (URL or base64) and text prompts via the OpenAI-compatible API endpoint. Use an API key for authentication.
ConvNeXt-Large D 320 uses 2.5x fewer GMACs, 2.8x fewer activations, and 1.22x fewer parameters than ViT-L/14 at 336x336, while achieving competitive zero-shot accuracy.
The model is released under the MIT license, as indicated on its Hugging Face page.
It achieves 76.9% top-1 zero-shot accuracy on ImageNet-1k.
The model card recommends against any deployed use without thorough in-domain testing, but the MIT license permits use subject to its terms. Always evaluate safety and bias for your specific use case.
We're benchmarking and onboarding CLIP ConvNeXt-Large D 320 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.