LENS-d8000
yibinlei/LENS-d8000
published Dec 2024 · updated Jan 2025
LENS-d8000 is a lexicon-based text embedding model that produces 8000-dimensional representations where each dimension corresponds to a token cluster.
specs
| Task | Text Embedding |
| Architecture | Mistral-based bidirectional |
| License | Apache-2.0 |
about this model
LENS-d8000 is a text embedding model that produces 8000-dimensional lexicon-based embeddings (LENS) where each dimension corresponds to a token cluster of semantically similar tokens, leveraging large language models. The model consolidates the vocabulary space of LLMs through token embedding clustering to address token redundancy, enabling compact representations with dimensionality comparable to dense embeddings.
Key Strengths
- Outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB) while delivering similar feature sizes.
- Supports efficient embedding dimension pruning without specialized objectives such as Matryoshka Representation Learning.
- When combined with dense embeddings, achieves state-of-the-art performance on the retrieval subset of MTEB (BEIR).
- Uses bidirectional attention and max-pooling over query tokens to produce high-quality embeddings for retrieval and classification tasks.
Benchmark Results
| Task | Metric | Score |
|---|---|---|
| AmazonCounterfactualClassification | Accuracy | 93.69 |
| AmazonPolarityClassification | Accuracy | 97.07 |
| AmazonReviewsClassification | Accuracy | 63.61 |
| ArguAna Retrieval | MAP@10 | 69.89 |
| ArguAna Retrieval | MRR@10 | 70.23 |
Additional MTEB results are available in the model's paper and Hugging Face model-index.
Model Details
- Accepted at ACL 2025 (paper: Enhancing Lexicon-Based Text Embeddings with Large Language Models).
- Licensed under Apache-2.0.
- Suitable for feature extraction, sentence similarity, and retrieval tasks — hosted as an OpenAI-compatible API on gigarouter.
best for
- ·Retrieval-augmented generation
- ·Semantic search and passage retrieval
- ·Lexicon-based similarity and matching
FAQ
It produces 8000-dimensional text embeddings for tasks like retrieval, clustering, and classification, with each dimension aligned to a token cluster.
It achieves competitive performance on MTEB while being lexicon-based, and can be combined with dense embeddings for state-of-the-art results on BEIR.
Apache-2.0.
Send requests to the OpenAI-compatible endpoint with your API key, using the model name "yibinlei/LENS-d8000".
Tokenized text with a special <instruct> and <query> format; instructions can be provided for task-specific prompts.
We're benchmarking and onboarding LENS-d8000 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.