LLM2Vec Llama 2 7B Chat Unsupervised SimCSE

McGill-NLP/LLM2Vec-Llama-2-7b-chat-hf-mntp-unsup-simcse

published Apr 2024 · updated Apr 2024

LLM2Vec Llama 2 7B Chat Unsupervised SimCSE is an embed model that converts decoder-only LLMs into text encoders using bidirectional attention, masked next token prediction, and unsupervised contrastive learning.

status

coming soon

API providers

downloads / mo

license

mit

specs

Task	Text Embedding
Architecture	LLM2Vec (Bidirectional Llama-2-7b with MNTP + Unsupervised SimCSE)
Parameters	7B
License	MIT

about this model

LLM2Vec-Llama-2-7b-chat-hf-mntp-unsup-simcse is an embedding model that converts a decoder-only large language model into a text encoder using a three-step unsupervised recipe: enabling bidirectional attention, masked next token prediction, and unsupervised contrastive learning (SimCSE). Built on Llama-2-7b-chat, it produces dense vector representations for tasks such as semantic search, retrieval, and clustering.

Key Strengths

This model achieves state-of-the-art unsupervised performance on the Massive Text Embeddings Benchmark (MTEB). It also outperforms encoder-only models by a large margin on word-level tasks. When further fine-tuned with supervised contrastive learning, LLM2Vec reaches state-of-the-art MTEB results among models trained exclusively on publicly available data (as of May 24, 2024). The transformation is parameter-efficient: it applies LoRA adapters without requiring expensive adaptation or synthetic data.

Benchmark Results

The model reaches a new unsupervised state-of-the-art on MTEB. Details on individual MTEB task scores are available in the original paper (COLM 2024).

Architecture Overview

The figure illustrates the three-step conversion process applied to a decoder-only LLM to produce a bidirectional text encoder.

Licensing and Availability

Released under the MIT license. The work was published at COLM 2024. The model is hosted on gigarouter as a managed API, eliminating the need for local setup or GPU infrastructure.

best for

·Retrieving relevant passages for web search queries
·Document similarity and semantic search
·Unsupervised sentence embedding for clustering or classification

FAQ

What input format does the model expect?

Accepts a two-part query with an instruction and text, or a document text. Uses mean pooling over token embeddings.

What is the maximum sequence length?

512 tokens.

What license is this model released under?

MIT License.

How do I use this model via the gigarouter API?

Call the OpenAI-compatible endpoint with your gigarouter API key and pass the input text as specified in the documentation.

How does this model compare to other embedding models?

It achieves unsupervised state-of-the-art performance on the MTEB benchmark, outperforming encoder-only models on word-level tasks.

not yet live

We're benchmarking and onboarding LLM2Vec Llama 2 7B Chat Unsupervised SimCSE as a hosted, OpenAI-compatible API. Sign in for free credit and be ready when it lands, or tell us you want it and we'll prioritize it.

related embeddings models

compare all →

nomic-embed-text-v1.5