Skip to content

Commit

Permalink
Add run_retriever entrypoint and add some new trec metrics. (#16)
Browse files Browse the repository at this point in the history
* [DOC] update singlehop_qa benchmark

* [FIX] set `DISABLE_CACHE` to False to disable batched_cache

* [FEAT](data) Add MappingDataset and MTEBDataset

* [FEAT](metrics) Implement following retrieval metrics:
1. RetrievalRecall;
2. RetrievalPrecision;

* [REFACTOR](metrics) Change the returned type of the metrics

* [REFACTOR](retriever) Move RetrievedContext into common_dataclasses.py

* [REFACTOR] Add id_field as a part of the Context.

* [REFACTOR] Refactor data module into three standalone module

* [REFACTOR] Use pytrec_eval to calculate the retrieval metrics

* [DOC] Optimize the documents for datasets module

* [DOC] Optimize the metric docstrings

* [DOC] Simplify the readme

* [DOC] Update the quickstart document

* [DOC] Update readme

* [FIX] Fix bug in SimpleWebDownloader

* [FEAT] Support JinaReaderLM2

* [FEAT] Add two tested encoder
  • Loading branch information
ZhuochengZhang98 authored Jan 23, 2025
1 parent 4650325 commit d5eb9af
Show file tree
Hide file tree
Showing 74 changed files with 1,913 additions and 1,239 deletions.
270 changes: 7 additions & 263 deletions README-zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
</p>

![Language](https://img.shields.io/badge/language-python-brightgreen)
[![Code Style](https://img.shields.io/badge/code%20style-black-black)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/imports-isort-blue)](https://pycqa.github.io/isort/)
[![github license](https://img.shields.io/github/license/ictnlp/flexrag)](LICENSE)
[![Read the Docs](https://img.shields.io/readthedocs/flexrag)](https://flexrag.readthedocs.io/en/latest/)
[![PyPI - Version](https://img.shields.io/pypi/v/flexrag)](https://pypi.org/project/flexrag/)
Expand All @@ -17,17 +19,6 @@ FlexRAG 是一个灵活的高性能框架,专为检索增强生成 (RAG) 任
- [✨ 框架特色](#-框架特色)
- [📢 最新消息](#-最新消息)
- [🚀 框架入门](#-框架入门)
- [步骤0. 安装](#步骤0-安装)
- [`pip`安装](#pip安装)
- [源码安装](#源码安装)
- [步骤1. 准备检索器](#步骤1-准备检索器)
- [下载知识库](#下载知识库)
- [构建索引](#构建索引)
- [步骤2. 运行 FlexRAG Assistant](#步骤2-运行-flexrag-assistant)
- [使用 GUI 运行 Modular Assistant](#使用-gui-运行-modular-assistant)
- [在知识密集型数据集上运行并测试 Modular Assistant](#在知识密集型数据集上运行并测试-modular-assistant)
- [开发您自己的 RAG Assistant](#开发您自己的-rag-assistant)
- [开发您自己的 RAG 应用](#开发您自己的-rag-应用)
- [🏗️ FlexRAG 架构](#️-flexrag-架构)
- [📊 基准测试](#-基准测试)
- [🏷️ 许可证](#️-许可证)
Expand All @@ -44,268 +35,21 @@ FlexRAG 是一个灵活的高性能框架,专为检索增强生成 (RAG) 任
- **轻量化**: FlexRAG 采用最少的开销设计,高效且易于集成到您的项目中。

# 📢 最新消息
- **2025-01-22**: 新的命令行入口 `run_retriever` 以及大量新的信息检索指标(如 `RetrievalMAP` )现已上线,请阅读[文档](https://flexrag.readthedocs.io/en/latest/)以获取更多信息。
- **2025-01-08**: FlexRAG 现已支持 Windows 系统,您可以直接通过 `pip install flexrag` 来安装。
- **2025-01-08**: FlexRAG 在单跳QA数据集上的基准测试现已公开,详情请参考 [benchmarks](benchmarks/README.md) 页面。
- **2025-01-05**: FlexRAG 的[文档](https://flexrag.readthedocs.io/en/latest/)现已上线。

# 🚀 框架入门

## 步骤0. 安装

### `pip`安装
`pip` 安装 FlexRAG:
```bash
pip install flexrag
```

### 源码安装
此外,您也可以从源码安装 FlexRAG:
```bash
pip install pybind11

git clone https://github.com/ictnlp/flexrag.git
cd flexrag
pip install ./
```
您也可以通过 `-e` 标志在可编辑模式下安装 FlexRAG。


## 步骤1. 准备检索器

### 下载知识库
在开始构建您的RAG应用之前,您需要准备语料库。在本例中,我们将使用[DPR](https://github.com/facebookresearch/DPR)提供的维基百科语料库,您可以通过如下命令来下载语料库:
```bash
# Download the corpus
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
# Unzip the corpus
gzip -d psgs_w100.tsv.gz
```

### 构建索引
下载语料库后,您需要为检索器构建索引。如果您想使用密集检索器,您可以运行以下命令来构建索引:
```bash
CORPUS_PATH='[psgs_w100.tsv]'
CORPUS_FIELDS='[title,text]'
DB_PATH=<path_to_database>

python -m flexrag.entrypoints.prepare_index \
corpus_path=$CORPUS_PATH \
saving_fields=$CORPUS_FIELDS \
retriever_type=dense \
dense_config.database_path=$DB_PATH \
dense_config.encode_fields=[text] \
dense_config.passage_encoder_config.encoder_type=hf \
dense_config.passage_encoder_config.hf_config.model_path='facebook/contriever' \
dense_config.passage_encoder_config.hf_config.device_id=[0,1,2,3] \
dense_config.index_type=faiss \
dense_config.faiss_config.batch_size=4096 \
dense_config.faiss_config.log_interval=100000 \
dense_config.batch_size=4096 \
dense_config.log_interval=100000 \
reinit=True
```

如果您想使用稀疏检索器,您可以运行以下命令来构建索引:
```bash
CORPUS_PATH='[psgs_w100.tsv]'
CORPUS_FIELDS='[title,text]'
DB_PATH=<path_to_database>

python -m flexrag.entrypoints.prepare_index \
corpus_path=$CORPUS_PATH \
saving_fields=$CORPUS_FIELDS \
retriever_type=bm25s \
bm25s_config.database_path=$DB_PATH \
bm25s_config.indexed_fields=[title,text] \
bm25s_config.method=lucene \
bm25s_config.batch_size=512 \
bm25s_config.log_interval=100000 \
reinit=True
```

## 步骤2. 运行 FlexRAG Assistant
当索引准备好后,您可以运行 FlexRAG 所提供的 `Assistant` 。以下是如何运行`Modular Assistant`的示例。

### 使用 GUI 运行 Modular Assistant
```bash
python -m flexrag.entrypoints.run_interactive \
assistant_type=modular \
modular_config.used_fields=[title,text] \
modular_config.retriever_type=dense \
modular_config.dense_config.top_k=5 \
modular_config.dense_config.database_path=${DB_PATH} \
modular_config.dense_config.query_encoder_config.encoder_type=hf \
modular_config.dense_config.query_encoder_config.hf_config.model_path='facebook/contriever' \
modular_config.dense_config.query_encoder_config.hf_config.device_id=[0] \
modular_config.response_type=short \
modular_config.generator_type=openai \
modular_config.openai_config.model_name='gpt-4o-mini' \
modular_config.openai_config.api_key=$OPENAI_KEY \
modular_config.do_sample=False
```

### 在知识密集型数据集上运行并测试 Modular Assistant
您可以在多个知识密集型数据集上轻松评估您的 RAG Assistant 。以下命令让您可以在 Natural Questions (NQ) 数据集上评估采用稠密检索器的`modular assistant`
```bash
OUTPUT_PATH=<path_to_output>
DB_PATH=<path_to_database>
OPENAI_KEY=<your_openai_key>

python -m flexrag.entrypoints.run_assistant \
data_path=flash_rag/nq/test.jsonl \
output_path=${OUTPUT_PATH} \
assistant_type=modular \
modular_config.used_fields=[title,text] \
modular_config.retriever_type=dense \
modular_config.dense_config.top_k=10 \
modular_config.dense_config.database_path=${DB_PATH} \
modular_config.dense_config.query_encoder_config.encoder_type=hf \
modular_config.dense_config.query_encoder_config.hf_config.model_path='facebook/contriever' \
modular_config.dense_config.query_encoder_config.hf_config.device_id=[0] \
modular_config.response_type=short \
modular_config.generator_type=openai \
modular_config.openai_config.model_name='gpt-4o-mini' \
modular_config.openai_config.api_key=$OPENAI_KEY \
modular_config.do_sample=False \
eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
eval_config.retrieval_success_rate_config.context_preprocess.processor_type=[simplify_answer] \
eval_config.retrieval_success_rate_config.eval_field=text \
eval_config.response_preprocess.processor_type=[simplify_answer] \
log_interval=10
```

相似地,您可以在 Natural Questions 数据集上评估采用稀疏检索器的`modular assistant`
```bash
OUTPUT_PATH=<path_to_output>
DB_PATH=<path_to_database>
OPENAI_KEY=<your_openai_key>

python -m flexrag.entrypoints.run_assistant \
data_path=flash_rag/nq/test.jsonl \
output_path=${OUTPUT_PATH} \
assistant_type=modular \
modular_config.used_fields=[title,text] \
modular_config.retriever_type=bm25s \
modular_config.bm25s_config.top_k=10 \
modular_config.bm25s_config.database_path=${DB_PATH} \
modular_config.response_type=short \
modular_config.generator_type=openai \
modular_config.openai_config.model_name='gpt-4o-mini' \
modular_config.openai_config.api_key=$OPENAI_KEY \
modular_config.do_sample=False \
eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
eval_config.retrieval_success_rate_config.context_preprocess.processor_type=[simplify_answer] \
eval_config.retrieval_success_rate_config.eval_field=text \
eval_config.response_preprocess.processor_type=[simplify_answer] \
log_interval=10
```

您也可以通过在命令行中添加 `user_module=<your_module_path>` 参数来评估您自己的助手。

### 开发您自己的 RAG Assistant
您也可以通过导入所需的 FlexRAG 模块来创建您自己的 RAG Assistant。以下是如何构建 RAG Assistant 的示例:
```python
from dataclasses import dataclass

from flexrag.assistant import ASSISTANTS, AssistantBase
from flexrag.models import OpenAIGenerator, OpenAIGeneratorConfig
from flexrag.prompt import ChatPrompt, ChatTurn
from flexrag.retriever import DenseRetriever, DenseRetrieverConfig


@dataclass
class SimpleAssistantConfig(DenseRetrieverConfig, OpenAIGeneratorConfig): ...


@ASSISTANTS("simple", config_class=SimpleAssistantConfig)
class SimpleAssistant(AssistantBase):
def __init__(self, config: SimpleAssistantConfig):
self.retriever = DenseRetriever(config)
self.generator = OpenAIGenerator(config)
return

def answer(self, question: str) -> str:
prompt = ChatPrompt()
context = self.retriever.search(question)[0]
prompt_str = ""
for ctx in context:
prompt_str += f"Question: {question}\nContext: {ctx.data['text']}"
prompt.update(ChatTurn(role="user", content=prompt_str))
response = self.generator.chat([prompt])[0][0]
prompt.update(ChatTurn(role="assistant", content=response))
return response
```
在完成`SimpleAssistant`定义并使用`ASSISTANTS`装饰器注册该 Assistant 后,您可以通过以下方式来运行您的 Assistant:
```bash
DB_PATH=<path_to_database>
OPENAI_KEY=<your_openai_key>
DATA_PATH=<path_to_data>
MODULE_PATH=<path_to_simple_assistant_module>

python -m flexrag.entrypoints.run_assistant \
user_module=${MODULE_PATH} \
data_path=${DATA_PATH} \
assistant_type=simple \
simple_config.model_name='gpt-4o-mini' \
simple_config.api_key=${OPENAI_KEY} \
simple_config.database_path=${DB_PATH} \
simple_config.index_type=faiss \
simple_config.query_encoder_config.encoder_type=hf \
simple_config.query_encoder_config.hf_config.model_path='facebook/contriever' \
simple_config.query_encoder_config.hf_config.device_id=[0] \
eval_config.metrics_type=[retrieval_success_rate,generation_f1,generation_em] \
eval_config.retrieval_success_rate_config.eval_field=text \
eval_config.response_preprocess.processor_type=[simplify_answer] \
log_interval=10
```
[flexrag_examples](https://github.com/ictnlp/flexrag_examples) 仓库中,我们也提供了一些示例,详细展示了如何利用 FlexRAG 框架构建 RAG 助手。

### 开发您自己的 RAG 应用
除了直接使用 FlexRAG 内置的 Entrypoints 来运行您的 RAG Assistant 以外,您也可以直接使用 FlexRAG 构建您自己的 RAG 应用。以下是如何构建 RAG 应用的示例:
```python
from flexrag.models import HFEncoderConfig, OpenAIGenerator, OpenAIGeneratorConfig
from flexrag.prompt import ChatPrompt, ChatTurn
from flexrag.retriever import DenseRetriever, DenseRetrieverConfig


def main():
# Initialize the retriever
retriever_cfg = DenseRetrieverConfig(database_path="path_to_database", top_k=1)
retriever_cfg.query_encoder_config.encoder_type = "hf"
retriever_cfg.query_encoder_config.hf_config = HFEncoderConfig(
model_path="facebook/contriever"
)
retriever = DenseRetriever(retriever_cfg)

# Initialize the generator
generator = OpenAIGenerator(
OpenAIGeneratorConfig(
model_name="gpt-4o-mini", api_key="your_openai_key", do_sample=False
)
)

# Run your RAG application
prompt = ChatPrompt()
while True:
query = input("Please input your query (type `exit` to exit): ")
if query == "exit":
break
context = retriever.search(query)[0]
prompt_str = ""
for ctx in context:
prompt_str += f"Question: {query}\nContext: {ctx.data['text']}"
prompt.update(ChatTurn(role="user", content=prompt_str))
response = generator.chat(prompt)
prompt.update(ChatTurn(role="assistant", content=response))
print(response)
return


if __name__ == "__main__":
main()
```
更多使用 FlexRAG 构建 RAG 应用的示例,请参考 [flexrag_examples](https://github.com/ictnlp/flexrag_examples) 仓库。

访问我们的[文档](https://flexrag.readthedocs.io/en/latest/)以了解更多信息。
- [安装](https://flexrag.readthedocs.io/en/latest/getting_started/installation.html)
- [快速入门](https://flexrag.readthedocs.io/en/latest/getting_started/quickstart.html)
- [命令行入口](https://flexrag.readthedocs.io/en/latest/tutorial/entrypoints.html)

# 🏗️ FlexRAG 架构
FlexRAG 采用**模块化**架构设计,让您可以轻松定制和扩展框架以满足您的特定需求。下图说明了 FlexRAG 的架构:
Expand Down
Loading

0 comments on commit d5eb9af

Please sign in to comment.