ModernBERT Base
answerdotai/ModernBERT-base
published Dec 2024 · updated Jan 2025
ModernBERT Base is a fill-mask model that modernizes the BERT architecture with Rotary Positional Embeddings, local-global alternating attention, and a native 8,192 token context length, pre-trained on 2 trillion tokens of English and code data.
specs
| Task | Fill-Mask |
| Architecture | Encoder-only Transformer with RoPE, Local-Global Alternating Attention, GeGLU activations |
| Parameters | 149 million |
| Context Length | 8,192 tokens |
| License | Apache 2.0 |
about this model
Key Strengths
ModernBERT-base (149M parameters) achieves strong results across natural language understanding, general retrieval, long-context retrieval, and code retrieval tasks. On GLUE, it scores 88.4, surpassing similarly-sized encoder models. For general retrieval (BEIR, DPR setting), it achieves 41.6, and for code retrieval, it sets new state-of-the-art results on CodeSearchNet (56.4) and StackQA (73.6). In multi-vector retrieval (ColBERT setting) on long-context out-of-domain data (MLDR_OOD), it reaches 80.2, significantly outperforming prior models.
Evaluation Results
| Task | Metric | ModernBERT-base |
|---|---|---|
| GLUE | NLU | 88.4 |
| BEIR (DPR) | Retrieval | 41.6 |
| MLDR_ID (DPR) | Long-context retrieval | 44.0 |
| CodeSearchNet | Code retrieval | 56.4 |
| StackQA | Code retrieval | 73.6 |
| BEIR (ColBERT) | Multi-vector retrieval | 51.3 |
| MLDR_OOD (ColBERT) | Long-context multi-vector retrieval | 80.2 |
Architecture and Training
The model uses a Pre-Norm Transformer with GeGLU activations, was pre-trained up to 1,024 tokens then extended to 8,192 tokens, and trained on 8x H100 GPUs. Training data is primarily English and code, so performance may be lower for other languages. The model is released under Apache 2.0 license.
best for
- ·Semantic search and retrieval over long documents
- ·Text classification and natural language understanding
- ·Code retrieval and hybrid text-code search
- ·Fill-mask for masked language modeling tasks
FAQ
It excels at retrieval, classification, and semantic search, especially on long documents and code, due to its 8,192 token context and training on text and code.
ModernBERT Base has 149M parameters (similar to BERT-base) but is more memory and speed efficient thanks to Flash Attention, unpadding, and modern architectural choices.
It is released under the Apache 2.0 license.
Input is a text string containing a [MASK] token. Token type IDs are not used. Use the fill-mask pipeline or AutoModelForMaskedLM from Hugging Face transformers v4.48.0+.
Use the gigarouter OpenAI-compatible endpoint with your API key, specifying the model as "answerdotai/ModernBERT-base" and sending a prompt containing [MASK].
We're benchmarking and onboarding ModernBERT Base as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.