Arctic Embed M Long
Snowflake/snowflake-arctic-embed-m-long
published Apr 2024 · updated Dec 2024
Arctic Embed M Long is a text embedding model optimized for long-context retrieval, supporting up to 2048 tokens (or 8192 with rotary position embeddings).
specs
| Task | Text Embedding / Retrieval |
| Architecture | Based on nomic-embed-text-v1-unsupervised (BERT-style) |
| Parameters | 137 million |
| License | Apache 2.0 |
| Embedding Dimension | 768 |
| Context Length | Up to 2048 tokens (8192 with RPE) |
about this model
Model Architecture and Training
Based on the nomic-embed-text-v1-unsupervised architecture, this 137-million-parameter model produces 768-dimensional embeddings. It was trained using a multi-stage pipeline: first, pretraining on approximately 400 million query-document pairs with in-batch negative mining, followed by fine-tuning on roughly 1 million triplets of query, positive document, and hard negative documents derived from harmful mining. The training methodology is detailed in the Arctic-Embed technical report.
Retrieval Performance
On the MTEB Retrieval benchmark (NDCG@10), snowflake-arctic-embed-m-long achieves a score of 54.83, outperforming comparable models:
| Model | MTEB Retrieval Score (NDCG@10) |
|---|---|
| snowflake-arctic-embed-m-long | 54.83 |
| nomic-embed-text-v1.5 | 53.01 |
| nomic-embed-text-v1 | 52.81 |
Key Strengths
- Extended context window: supports 2048 tokens natively and up to 8192 tokens with RPE, making it suitable for long-document retrieval workloads.
- Competitive accuracy: delivers retrieval quality near the larger snowflake-arctic-embed-l model (55.98) while using fewer parameters.
- Open-source: released under the Apache-2.0 license with weights available for inspection and use.
Supported Formats
The model is available in ONNX, Safetensors, and Transformers.js formats, and is compatible with the Text Embeddings Inference framework.
best for
- ·Long-document semantic search
- ·Retrieval-Augmented Generation (RAG) with extended context
- ·Embedding large passages for similarity matching
FAQ
It is best for retrieval tasks that require embedding long documents or passages, supporting up to 2048 tokens natively and up to 8192 tokens with rotary position embeddings.
It has the same embedding dimension (768) but supports longer context (up to 2048/8192 tokens vs. 512 tokens) and has 137M parameters vs. 110M.
The model is released under the Apache 2.0 license.
Use the OpenAI-compatible endpoint with your API key, sending texts to the embeddings endpoint.
It achieves a score of 54.83 NDCG@10 on the MTEB Retrieval benchmark.
We're benchmarking and onboarding Arctic Embed M Long as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.