CLIP ConvNeXt-Large D 320

laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup

published Feb 2023 · updated Jan 2025

CLIP ConvNeXt-Large D 320 is a zero-shot-image model that uses a ConvNeXt-Large image tower with an MLP head and a deeper text tower, fine-tuned at 320x320 resolution via model soup averaging for improved zero-shot classification.

status

coming soon

API providers

downloads / mo

118.7K

license

mit

specs

Task	Zero-shot image classification and image-text retrieval
Architecture	ConvNeXt-Large image tower (with MLP head) + deeper text tower (depth 16, embed dim 768)
Parameters	~200M (ConvNeXt-Large base) — 1.22x fewer than ViT-L/14
License	MIT

about this model

laion/CLIP-convnext_large_d_320.laion2B-s29B-b131K-ft-soup is a zero-shot image classification model that performs contrastive language-image pretraining (CLIP) using a ConvNeXt-Large vision backbone with an MLP head and a deeper text tower (depth 16, embed dim 768). It is a weight-averaged soup of three fine-tunes of the base 256x256 ConvNeXt-Large-D model, each fine-tuned at 320x320 resolution on LAION-2B for an additional 2–3 billion samples with different learning rates (1e-4, 6e-5, 5e-5).

The model achieves 76.9% top-1 zero-shot accuracy on ImageNet-1k, improving over the single fine-tune (76.6%) and the base 256x256 model (75.9%). At 320x320 resolution, it is significantly more efficient than OpenAI's ViT-L/14 at 336x336: 2.5x fewer GMACs, 2.8x fewer activations, and 1.22x fewer parameters.

Architecture and Training

Vision tower: ConvNeXt-Large (timm convnext_large) with an MLP head (fc–gelu–drop–fc) instead of a single projection.
Text tower: 4 layers deeper than ViT-L / RN50x16 models (depth 16, embed dim 768).
Training data: LAION-2B (English subset of LAION-5B, 2.32B image-text pairs).
Augmentation: Random Resize Crop (0.5, 1.0), Random Erasing (prob 0.4), Stochastic Depth (prob 0.1) on image tower, no dropout on head.
Batch size: 131,072; trained on 64 nodes of 8×A100 40GB GPUs (Stability AI cluster).

Benchmark Results

Model	Resolution	ImageNet Zero-Shot Top-1 (%)
convnext_large_d (base)	256x256	75.9
convnext_large_d_320 (single ft)	320x320	76.6
convnext_large_d_320 (soup)	320x320	76.9

Additional benchmarks across the VTAB+ suite (VTAB plus robustness datasets) and COCO/Flickr retrieval are available in the LAION CLIP Benchmark repository.

Licensed under MIT. This model is intended for research use; deployed applications require thorough domain-specific testing due to variability in zero-shot performance across class taxonomies.

best for

·Zero-shot image classification with custom class taxonomies
·Image-text retrieval (search images by text or vice versa)
·Fine-tuning for downstream image classification tasks

FAQ

What is the input format for this model on gigarouter?

The model accepts images (URL or base64) and text prompts via the OpenAI-compatible API endpoint. Use an API key for authentication.

How does this model compare to OpenAI's ViT-L/14 in efficiency?

ConvNeXt-Large D 320 uses 2.5x fewer GMACs, 2.8x fewer activations, and 1.22x fewer parameters than ViT-L/14 at 336x336, while achieving competitive zero-shot accuracy.

What is the license for this model?

The model is released under the MIT license, as indicated on its Hugging Face page.

What is the model's zero-shot accuracy on ImageNet?

It achieves 76.9% top-1 zero-shot accuracy on ImageNet-1k.

Can I use this model for commercial applications?

The model card recommends against any deployed use without thorough in-domain testing, but the MIT license permits use subject to its terms. Always evaluate safety and bias for your specific use case.

not yet live

We're benchmarking and onboarding CLIP ConvNeXt-Large D 320 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related zero-shot image models

compare all →

clip-vit-base-patch32

22.3M dl/mo

clip-vit-large-patch14

12.4M dl/mo

CLIP-ViT-B-32-laion2B-s34B-b79K

4M dl/mo

clip-vit-large-patch14-336