Neural-Cherche

Neural Search

Neural-Cherche is a library designed to fine-tune neural search models such as Splade, ColBERT, and SparseEmbed on a specific dataset. Neural-Cherche also provide classes to run efficient inference on a fine-tuned retriever or ranker. Neural-Cherche aims to offer a straightforward and effective method for fine-tuning and utilizing neural search models in both offline and online settings. It also enables users to save all computed embeddings to prevent redundant computations.

Neural-Cherche is compatible with CPU, GPU and MPS devices. We can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint. Splade and SparseEmbed are more tricky to fine-tune and need a MLM pre-trained model.

Installation

We can install neural-cherche using:

pip install neural-cherche

If we plan to evaluate our model while training install:

pip install "neural-cherche[eval]"

Documentation

The complete documentation is available here.

Quick Start

Your training dataset must be made out of triples (anchor, positive, negative) where anchor is a query, positive is a document that is directly linked to the anchor and negative is a document that is not relevant for the anchor.

X = [
    ("anchor 1", "positive 1", "negative 1"),
    ("anchor 2", "positive 2", "negative 2"),
    ("anchor 3", "positive 3", "negative 3"),
]

And here is how to fine-tune ColBERT from a Sentence Transformer pre-trained checkpoint using neural-cherche:

import torch

from neural_cherche import models, utils, train

model = models.ColBERT(
    model_name_or_path="raphaelsty/neural-cherche-colbert",
    device="cuda" if torch.cuda.is_available() else "cpu" # or mps
)

optimizer = torch.optim.AdamW(model.parameters(), lr=3e-6)

X = [
    ("query", "positive document", "negative document"),
    ("query", "positive document", "negative document"),
    ("query", "positive document", "negative document"),
]

for step, (anchor, positive, negative) in enumerate(utils.iter(
        X,
        epochs=1, # number of epochs
        batch_size=8, # number of triples per batch
        shuffle=True
    )):

    loss = train.train_colbert(
        model=model,
        optimizer=optimizer,
        anchor=anchor,
        positive=positive,
        negative=negative,
        step=step,
        gradient_accumulation_steps=50,
    )

    
    if (step + 1) % 1000 == 0:
        # Save the model every 1000 steps
        model.save_pretrained("checkpoint")

Retrieval

Here is how to use the fine-tuned ColBERT model to re-rank documents:

import torch
from lenlp import sparse

from neural_cherche import models, rank, retrieve

documents = [
    {"id": "doc1", "title": "Paris", "text": "Paris is the capital of France."},
    {"id": "doc2", "title": "Montreal", "text": "Montreal is the largest city in Quebec."},
    {"id": "doc3", "title": "Bordeaux", "text": "Bordeaux in Southwestern France."},
]

retriever = retrieve.BM25(
    key="id",
    on=["title", "text"],
    count_vectorizer=sparse.CountVectorizer(
        normalize=True, ngram_range=(3, 5), analyzer="char_wb", stop_words=[]
    ),
    k1=1.5,
    b=0.75,
    epsilon=0.0,
)

model = models.ColBERT(
    model_name_or_path="raphaelsty/neural-cherche-colbert",
    device="cuda" if torch.cuda.is_available() else "cpu",  # or mps
)

ranker = rank.ColBERT(
    key="id",
    on=["title", "text"],
    model=model,
)

documents_embeddings = retriever.encode_documents(
    documents=documents,
)

retriever.add(
    documents_embeddings=documents_embeddings,
)

Now we can retrieve documents using the fine-tuned model:

queries = ["Paris", "Montreal", "Bordeaux"]

queries_embeddings = retriever.encode_queries(
    queries=queries,
)

ranker_queries_embeddings = ranker.encode_queries(
    queries=queries,
)

candidates = retriever(
    queries_embeddings=queries_embeddings,
    batch_size=32,
    k=100,  # number of documents to retrieve
)

# Compute embeddings of the candidates with the ranker model.
# Note, we could also pre-compute all the embeddings.
ranker_documents_embeddings = ranker.encode_candidates_documents(
    candidates=candidates,
    documents=documents,
    batch_size=32,
)

scores = ranker(
    queries_embeddings=ranker_queries_embeddings,
    documents_embeddings=ranker_documents_embeddings,
    documents=candidates,
    batch_size=32,
)

scores

[[{'id': 0, 'similarity': 22.825355529785156},
  {'id': 1, 'similarity': 11.201947212219238},
  {'id': 2, 'similarity': 10.748161315917969}],
 [{'id': 1, 'similarity': 23.21628189086914},
  {'id': 0, 'similarity': 9.9658203125},
  {'id': 2, 'similarity': 7.308732509613037}],
 [{'id': 1, 'similarity': 6.4031805992126465},
  {'id': 0, 'similarity': 5.601611137390137},
  {'id': 2, 'similarity': 5.599479675292969}]]

Neural-Cherche provides a SparseEmbed, a SPLADE, a TFIDF, a BM25 retriever and a ColBERT ranker which can be used to re-order output of a retriever. For more information, please refer to the documentation.

Pre-trained Models

We provide pre-trained checkpoints specifically designed for neural-cherche: raphaelsty/neural-cherche-sparse-embed and raphaelsty/neural-cherche-colbert. Those checkpoints are fine-tuned on a subset of the MS-MARCO dataset and would benefit from being fine-tuned on your specific dataset. You can fine-tune ColBERT from any Sentence Transformer pre-trained checkpoint in order to fit your specific language. You should use a MLM based-checkpoint to fine-tune SparseEmbed.

		scifact dataset
model	HuggingFace Checkpoint	ndcg@10	hits@10	hits@1
TfIdf	-	0,62	0,86	0,50
BM25	-	0,69	0,92	0,56
SparseEmbed	raphaelsty/neural-cherche-sparse-embed	0,62	0,87	0,48
Sentence Transformer	sentence-transformers/all-mpnet-base-v2	0,66	0,89	0,53
ColBERT	raphaelsty/neural-cherche-colbert	0,70	0,92	0,58
TfIDF Retriever + ColBERT Ranker	raphaelsty/neural-cherche-colbert	0,71	0,94	0,59
BM25 Retriever + ColBERT Ranker	raphaelsty/neural-cherche-colbert	0,72	0,95	0,59

Neural-Cherche Contributors

References

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking authored by Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2021.
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval authored by Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stéphane Clinchant, SIGIR 2022.
SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval authored by Weize Kong, Jeffrey M. Dudek, Cheng Li, Mingyang Zhang, and Mike Bendersky, SIGIR 2023.
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT authored by Omar Khattab, Matei Zaharia, SIGIR 2020.

License

This Python library is licensed under the MIT open-source license, and the splade model is licensed as non-commercial only by the authors. SparseEmbed and ColBERT are fully open-source including commercial usage.

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
docs		docs
neural_cherche		neural_cherche
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pytest.ini		pytest.ini
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural-Cherche

Installation

Documentation

Quick Start

Retrieval

Pre-trained Models

Neural-Cherche Contributors

References

License

About

Releases 18

Packages

Contributors 4

Languages

License

raphaelsty/neural-cherche

Folders and files

Latest commit

History

Repository files navigation

Neural-Cherche

Installation

Documentation

Quick Start

Retrieval

Pre-trained Models

Neural-Cherche Contributors

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 18

Packages 0

Contributors 4

Languages

Packages