LENS-d4000
yibinlei/LENS-d4000
published Dec 2024 · updated Jan 2025
LENS-d4000 is a lexicon-based text embedding model that produces 4000-dimensional representations where each dimension corresponds to a cluster of semantically similar tokens, leveraging large language models.
specs
| Task | Text embedding (feature extraction) |
| Architecture | Lexicon-based embedding (LENS) with bidirectional Mistral and max-pooling |
| Parameters | 4000-dimensional embeddings |
| License | Not specified |
about this model
LENS-d4000 is a lexicon-based text embedding model that produces 4000-dimensional representations, where each dimension corresponds to a cluster of semantically similar tokens, enabling interpretable and competitive embeddings for retrieval and classification tasks.
Key Strengths
LENS consolidates vocabulary from large language models via token-embedding clustering, addressing token redundancy. It employs bidirectional attention and max-pooling to generate compact embeddings that rival dense alternatives. The model inherently supports efficient dimension pruning without specialized objectives such as Matryoshka Representation Learning.
Benchmark Performance
LENS-d4000 achieves strong results across the Massive Text Embedding Benchmark (MTEB):
- AmazonCounterfactualClassification (en): accuracy 93.61%, F1 90.32%, AP 73.89%
- AmazonPolarityClassification: accuracy 97.05%, F1 97.05%, AP 95.53%
- AmazonReviewsClassification (en): accuracy 62.83%, F1 61.46%
- ArguAna retrieval: NDCG@10 77.32, MAP@1 56.76%, MAP@10 71.14%, MRR@10 71.49%
According to the paper (accepted at ACL 2025), LENS outperforms dense embeddings on MTEB and, when combined with dense embeddings, achieves state-of-the-art performance on the BEIR retrieval subset.
Usage in Production
Gigarouter hosts LENS-d4000 as a managed, OpenAI-compatible API. Developers can generate embeddings by sending queries and documents with a task instruction prefix, bypassing local model loading and inference infrastructure.
best for
- ·Web search retrieval: retrieving relevant passages for a given query
- ·Classification tasks: e.g., Amazon reviews classification and polarity classification
- ·Lexicon-based matching: tasks requiring interpretable dimension-to-token correspondences
FAQ
LENS-d4000 provides lexicon-based embeddings where each dimension corresponds to a token cluster, enabling interpretable lexical matching and efficient dimension pruning without specialized training objectives.
Use the gigarouter OpenAI-compatible endpoint with your API key, sending text inputs and receiving 4000-dimensional embeddings in response.
Input is text (queries or documents) with optional task instruction; output is a normalized 4000-dimensional embedding vector. Max sequence length is 512 tokens.
The model card does not specify a license; no license information is available on the Hugging Face page.
It achieves strong results, e.g., AmazonCounterfactualClassification accuracy 93.61%, AmazonPolarityClassification accuracy 97.05%, and ArguAna retrieval NDCG@10 77.32.
We're benchmarking and onboarding LENS-d4000 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.