LENS-d4000

yibinlei/LENS-d4000

published Dec 2024 · updated Jan 2025

LENS-d4000 is a lexicon-based text embedding model that produces 4000-dimensional representations where each dimension corresponds to a cluster of semantically similar tokens, leveraging large language models.

est. price

~$0.008

/ 1M tokens · estimated, set at launch

API providers

downloads / mo

101

license

apache-2.0

specs

Task	Text embedding (feature extraction)
Architecture	Lexicon-based embedding (LENS) with bidirectional Mistral and max-pooling
Parameters	4000-dimensional embeddings
License	Not specified

about this model

LENS-d4000 is a lexicon-based text embedding model that produces 4000-dimensional representations, where each dimension corresponds to a cluster of semantically similar tokens, enabling interpretable and competitive embeddings for retrieval and classification tasks.

Key Strengths

LENS consolidates vocabulary from large language models via token-embedding clustering, addressing token redundancy. It employs bidirectional attention and max-pooling to generate compact embeddings that rival dense alternatives. The model inherently supports efficient dimension pruning without specialized objectives such as Matryoshka Representation Learning.

Benchmark Performance

LENS-d4000 achieves strong results across the Massive Text Embedding Benchmark (MTEB):

AmazonCounterfactualClassification (en): accuracy 93.61%, F1 90.32%, AP 73.89%
AmazonPolarityClassification: accuracy 97.05%, F1 97.05%, AP 95.53%
AmazonReviewsClassification (en): accuracy 62.83%, F1 61.46%
ArguAna retrieval: NDCG@10 77.32, MAP@1 56.76%, MAP@10 71.14%, MRR@10 71.49%

According to the paper (accepted at ACL 2025), LENS outperforms dense embeddings on MTEB and, when combined with dense embeddings, achieves state-of-the-art performance on the BEIR retrieval subset.

Usage in Production

Gigarouter hosts LENS-d4000 as a managed, OpenAI-compatible API. Developers can generate embeddings by sending queries and documents with a task instruction prefix, bypassing local model loading and inference infrastructure.

best for

·Web search retrieval: retrieving relevant passages for a given query
·Classification tasks: e.g., Amazon reviews classification and polarity classification
·Lexicon-based matching: tasks requiring interpretable dimension-to-token correspondences

FAQ

What is the main advantage of LENS-d4000 over dense embeddings?

LENS-d4000 provides lexicon-based embeddings where each dimension corresponds to a token cluster, enabling interpretable lexical matching and efficient dimension pruning without specialized training objectives.

How do I call LENS-d4000 via the gigarouter API?

Use the gigarouter OpenAI-compatible endpoint with your API key, sending text inputs and receiving 4000-dimensional embeddings in response.

What is the input and output format for LENS-d4000?

Input is text (queries or documents) with optional task instruction; output is a normalized 4000-dimensional embedding vector. Max sequence length is 512 tokens.

What license is LENS-d4000 released under?

The model card does not specify a license; no license information is available on the Hugging Face page.

How does LENS-d4000 perform on the MTEB benchmark?

It achieves strong results, e.g., AmazonCounterfactualClassification accuracy 93.61%, AmazonPolarityClassification accuracy 97.05%, and ArguAna retrieval NDCG@10 77.32.

not yet live

We're benchmarking and onboarding LENS-d4000 as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5