LENS-d8000

yibinlei/LENS-d8000

published Dec 2024 · updated Jan 2025

LENS-d8000 is a lexicon-based text embedding model that produces 8000-dimensional representations where each dimension corresponds to a token cluster.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

108

license

apache-2.0

specs

Task	Text Embedding
Architecture	Mistral-based bidirectional
License	Apache-2.0

about this model

LENS-d8000 is a text embedding model that produces 8000-dimensional lexicon-based embeddings (LENS) where each dimension corresponds to a token cluster of semantically similar tokens, leveraging large language models. The model consolidates the vocabulary space of LLMs through token embedding clustering to address token redundancy, enabling compact representations with dimensionality comparable to dense embeddings.

Key Strengths

Outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB) while delivering similar feature sizes.
Supports efficient embedding dimension pruning without specialized objectives such as Matryoshka Representation Learning.
When combined with dense embeddings, achieves state-of-the-art performance on the retrieval subset of MTEB (BEIR).
Uses bidirectional attention and max-pooling over query tokens to produce high-quality embeddings for retrieval and classification tasks.

Benchmark Results

Task	Metric	Score
AmazonCounterfactualClassification	Accuracy	93.69
AmazonPolarityClassification	Accuracy	97.07
AmazonReviewsClassification	Accuracy	63.61
ArguAna Retrieval	MAP@10	69.89
ArguAna Retrieval	MRR@10	70.23

Additional MTEB results are available in the model's paper and Hugging Face model-index.

Model Details

Accepted at ACL 2025 (paper: Enhancing Lexicon-Based Text Embeddings with Large Language Models).
Licensed under Apache-2.0.
Suitable for feature extraction, sentence similarity, and retrieval tasks — hosted as an OpenAI-compatible API on gigarouter.

best for

·Retrieval-augmented generation
·Semantic search and passage retrieval
·Lexicon-based similarity and matching

FAQ

What is LENS-d8000 used for?

It produces 8000-dimensional text embeddings for tasks like retrieval, clustering, and classification, with each dimension aligned to a token cluster.

How does LENS-d8000 compare to dense embeddings?

It achieves competitive performance on MTEB while being lexicon-based, and can be combined with dense embeddings for state-of-the-art results on BEIR.

What is the license for LENS-d8000?

Apache-2.0.

How can I use LENS-d8000 via the gigarouter API?

Send requests to the OpenAI-compatible endpoint with your API key, using the model name "yibinlei/LENS-d8000".

What is the input format for this model?

Tokenized text with a special <instruct> and <query> format; instructions can be provided for task-specific prompts.

not yet live

We're benchmarking and onboarding LENS-d8000 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5