Nomic Embed Text V1.5

nomic-ai/nomic-embed-text-v1.5

published Feb 2024 · updated Apr 2026

Nomic Embed Text V1.5 is a text embedding model that uses Matryoshka Representation Learning to allow flexible dimensionality reduction while maintaining high performance.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

16.9M

license

apache-2.0

specs

Task	Text embedding
Architecture	BERT-based with Matryoshka Representation Learning
Context Length	8192 tokens
Dimensions	768 (native), adjustable from 64 to 768
MTEB Score	62.28
License	Apache 2.0

about this model

nomic-embed-text-v1.5 is a text embedding model that produces fixed-size vector representations of text, supporting a maximum sequence length of 8192 tokens and utilizing Matryoshka Representation Learning to allow developers to trade off embedding size for a negligible reduction in performance.

The model requires a task instruction prefix to indicate the embedding use case: search_document for indexing documents, search_query for user queries, clustering for grouping texts, and classification for feature extraction. It natively supports scaling sequence length past 2048 tokens.

Key Strengths

Matryoshka Representation Learning enables flexible dimensionality reduction from 768 down to 64 dimensions with minimal performance loss.
Maximum sequence length of 8192 tokens.
Fully reproducible, open-source, open-weights, and open-data model (Apache 2.0 license).

MTEB Performance by Dimension

Model	SeqLen	Dimension	MTEB
nomic-embed-text-v1	8192	768	62.39
nomic-embed-text-v1.5	8192	768	62.28
nomic-embed-text-v1.5	8192	512	61.96
nomic-embed-text-v1.5	8192	256	61.04
nomic-embed-text-v1.5	8192	128	59.34
nomic-embed-text-v1.5	8192	64	56.10

MTEB performance comparison across embedding dimensions for nomic-embed-text-v1.5

The model is trained using a multi-stage pipeline starting from a long-context BERT model. The first unsupervised contrastive stage trains on weakly related text pairs from sources such as StackExchange, Quora, Amazon reviews, and news article summarizations. The second finetuning stage leverages higher quality labeled datasets including search queries and answers from web searches, with emphasis on data curation and hard-example mining.

Training data and code are released in full under an Apache 2.0 license. For further details, see the technical report and the contrastors repository.

best for

·Building a retrieval-augmented generation (RAG) system with adjustable embedding size
·Clustering large text datasets with dimensionality reduction
·Multi-task classification and search using task-specific prefixes

FAQ

What task prefixes are required when using Nomic Embed Text V1.5?

The model requires a task instruction prefix such as search_document, search_query, clustering, or classification to indicate the intended use.

Can I change the embedding dimension to reduce storage or compute?

Yes, you can adjust the dimension from 768 down to 64 using Matryoshka layer normalization and slicing with minimal performance loss.

What is the maximum sequence length the model supports?

The model natively supports up to 8192 tokens, and can be extended further with dynamic RoPE scaling.

What license does Nomic Embed Text V1.5 use?

It is released under the Apache 2.0 license, allowing free use, modification, and distribution.

How do I call this model via the gigarouter API?

Use the OpenAI-compatible endpoint with your gigarouter API key, specifying the model name nomic-ai/nomic-embed-text-v1.5 and the appropriate task prefix in the input.

not yet live

We're benchmarking and onboarding Nomic Embed Text V1.5 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

granite-embedding-small-english-r2

2.2M dl/mo