feat(service/feature_store.py): optimize usage

InternLM · Jan 11, 2024 · 57c8ff8 · 57c8ff8
1 parent 5b8e50f
commit 57c8ff8
Show file tree

Hide file tree

Showing 3 changed files with 31 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -3,8 +3,8 @@
 
 <small> [简体中文](README_zh.md) | English </small>
 
-[![GitHub license](https://img.shields.io/badge/license-BSD--3--Clause-brightgreen.svg)](./LICENSE)
-![CI](https://img.shields.io/github/actions/workflow/status/internml/huixiangdou/lint.yml?branch=master)
+[![GitHub license](https://img.shields.io/badge/license-BSD--3--Clause-brightgreen.svg?style=plastic)](./LICENSE)
+![CI](https://img.shields.io/github/actions/workflow/status/internml/huixiangdou/lint.yml?branch=master&style=plastic)
 
 </div>
 
@@ -17,13 +17,13 @@ View [HuixiangDou inside](./huixiangdou-inside.md).
 
 # 📦 Hardware Requirements
 
-The following are the hardware requirements for running Fennel. It is suggested to follow the deployment process, starting with the basic version and gradually experiencing advanced features.
+The following are the hardware requirements for running. It is suggested to follow this document, starting with the basic version and gradually experiencing advanced features.
 
-| Version | Hardware Requirements | Remarks |
-| :-: | :-: | :-: |
-| Basic Version | 20GB GPU memory, such as 3090 or above | Able to answer basic domain knowledge questions, zero cost operation |
-| Advanced Version | 40GB GPU memory, such as A100 | Able to answer source code level questions, zero cost operation |
-| Modified Version | 4GB graphics memory, such as 3050/2080ti | Using openai API to replace local LLM, basic development capability required, operation involves cost |
+| Version | GPU Memory Requirements | Remarks | Tested on |
+| :-: | :-: | :-: | :-: |
+| Basic Version | 20GB | Able to answer basic domain knowledge questions, zero cost operation | ![](https://img.shields.io/badge/linux%203090%2024G-passed-blue?style=for-the-badge) |
+| Advanced Version | 40GB | Able to answer source code level questions, zero cost operation | ![](https://img.shields.io/badge/linux%20A100%2080G-passed-blue?style=for-the-badge) |
+| Modified Version | 4GB | Using openai API to replace local LLM, basic development capability required, operation involves cost | ![](https://img.shields.io/badge/linux%201660ti%206GB-passed-blue?style=for-the-badge) |
 
 # 🔥 Run
 
@@ -34,15 +34,17 @@ We will take lmdeploy & mmpose as examples to explain how to deploy the knowledg
 ```shell
 # Download chat topics
 mkdir repodir
-git clone https://github.com/openmmlab/mmpose --depth=1 repodir/mmpose
+git clone https://github.com/open-mmlab/mmpose --depth=1 repodir/mmpose
 git clone https://github.com/internlm/lmdeploy --depth=1 repodir/lmdeploy
 
 # Establish feature repository
 cd HuixiangDou && mkdir workdir # Create working directory
 python3 -m pip install -r requirements.txt # Install dependencies, python3.11 requires `conda install conda-forge::faiss-gpu`
-python3 service/feature_store.py repodir workdir # Save features from repodir to workdir
+python3 service/feature_store.py # Save features from repodir to workdir
 ```
 
+The first run will automatically download the configuration of [text2vec-large-chinese](https://huggingface.co/GanymedeNil/text2vec-large-chinese), you can also manually download it and update model path in `config.ini`.
+
 After running, HuixiangDou can distinguish which user topics should be dealt with and which chitchats should be rejected. Please edit [good_questions](./resource/good_questions.json) and [bad_questions](./resource/bad_questions.json), and try your own domain knowledge (medical, finance, electricity, etc.).
 
 ```shell
@@ -72,7 +74,7 @@ x_api_key = "${YOUR-X-API-KEY}"
 
 Please ensure that the GPU memory is over 20GB (such as 3090 or above). If the memory is low, please modify it according to the FAQ.
 
-The first run will automatically download the configuration of internlm2-7B and text2vec-large-chinese, please ensure network connectivity.
+The first run will automatically download the configuration of internlm2-7B.
 
   * **Non-docker users**. If you **don't** use docker environment, you can start all services at once.
     ```shell

diff --git a/README_zh.md b/README_zh.md
@@ -3,8 +3,8 @@
 
 <small> 简体中文 | [English](README.md) </small>
 
-[![GitHub license](https://img.shields.io/badge/license-BSD--3--Clause-brightgreen.svg)](./LICENSE)
-![CI](https://img.shields.io/github/actions/workflow/status/internml/huixiangdou/lint.yml?branch=master)
+[![GitHub license](https://img.shields.io/badge/license-BSD--3--Clause-brightgreen.svg?style=plastic)](./LICENSE)
+![CI](https://img.shields.io/github/actions/workflow/status/internml/huixiangdou/lint.yml?branch=master&style=plastic)
 </div>
 
 “茴香豆”是一个基于 LLM 的领域特定知识助手。特点：
@@ -18,11 +18,11 @@
 
 以下是运行茴香豆的硬件需求。建议遵循部署流程，从基础版开始，逐渐体验高级特性。
 
-| 版本 | 硬件需求 | 备注 |
-| :-: | :-: | :-: |
-| 基础版 | 20G GPU 显存，如 3090 及以上 | 能够回答领域知识的基础问题，零成本运行 |
-| 高级版 | 40G 显存，如 A100 | 能够回答源码级问题，零成本运行 |
-| 魔改版 | 4G 显存，如 3050/2080ti | 用 openai API 替代本地 LLM，需要基础开发能力，运行需要费用 |
+| 版本 | 硬件需求 | 备注 | 已验证设备 |
+| :-: | :-: | :-: | :-: |
+| 基础版 | 20GB | 能够回答领域知识的基础问题，零成本运行 | ![](https://img.shields.io/badge/linux%203090%2024G-passed-blue?style=for-the-badge) |
+| 高级版 | 40GB | 能够回答源码级问题，零成本运行 | ![](https://img.shields.io/badge/linux%20A100%2080G-passed-blue?style=for-the-badge) |
+| 魔改版 | 4GB| 用 openai API 替代本地 LLM，需要基础开发能力，运行需要费用 | ![](https://img.shields.io/badge/linux%201660ti%206GB-passed-blue?style=for-the-badge) |
 
 # 🔥 运行
 
@@ -32,15 +32,17 @@
 ```shell
 # 下载聊天话题
 mkdir repodir
-git clone https://github.com/openmmlab/mmpose --depth=1 repodir/mmpose
+git clone https://github.com/open-mmlab/mmpose --depth=1 repodir/mmpose
 git clone https://github.com/internlm/lmdeploy --depth=1 repodir/lmdeploy
 
 # 建立特征库
 cd HuixiangDou && mkdir workdir # 创建工作目录
 python3 -m pip install -r requirements.txt # 安装依赖，python3.11 需要 `conda install conda-forge::faiss-gpu`
-python3 service/feature_store.py repodir workdir # 把 repodir 的特征保存到 workdir
+python3 service/feature_store.py # 把 repodir 的特征保存到 workdir
 ```
-运行结束后，茴香豆能够区分应该处理哪些用户话题，哪些闲聊应该拒绝。请编辑 [good_questions](./resource/good_questions.json) 和 [bad_questions](./resource/bad_questions.json)，尝试自己的领域知识（医疗，金融，电力等）。
+首次运行将自动下载配置中的 [text2vec-large-chinese](https://huggingface.co/GanymedeNil/text2vec-large-chinese)，如果自动下载失败，可以手动下载到本地，然后在 `config.ini` 设置模型路径。
+
+结束后，茴香豆能够区分应该处理哪些用户话题，哪些闲聊应该拒绝。请编辑 [good_questions](./resource/good_questions.json) 和 [bad_questions](./resource/bad_questions.json)，尝试自己的领域知识（医疗，金融，电力等）。
 
 ```shell
 # 接受技术话题
@@ -69,7 +71,7 @@ x_api_key = "${YOUR-X-API-KEY}"
 
 请保证 GPU 显存超过 20GB（如 3090 及以上），若显存较低请按 FAQ 修改。
 
-首次运行将自动下载配置中的 internlm2-7B 和 text2vec-large-chinese，请保证网络畅通。
+首次运行将自动下载配置中的 internlm2-7B，请保证网络畅通。
 
   * **非 docker 用户**。如果你**不**使用 docker 环境，可以一次启动所有服务。
     ```shell

diff --git a/service/feature_store.py b/service/feature_store.py
@@ -338,6 +338,8 @@ def preprocess(self, repo_dir: str, work_dir: str):
                 if file.endswith('.md') and 'mdb' not in file:
                     mds.append(os.path.join(root, file))
 
+        if len(mds) < 1:
+            raise Exception(f'cannot search any markdown file, please check usage: python3 {__file__} workdir repodir')
         # copy each file to ./finetune-data/ with new name
         for _file in mds:
             tmp = _file.replace("/", "_")
@@ -398,10 +400,11 @@ def initialize(self,
 def parse_args():
     parser = argparse.ArgumentParser(
         description='Feature store for processing directories.')
-    parser.add_argument('work_dir', type=str, help='Working directory.')
+    parser.add_argument('--work_dir', type=str, default='workdir', help='Working directory.')
     parser.add_argument(
-        'repo_dir',
+        '--repo_dir',
         type=str,
+        default='repodir',
         help='Root directory where the repositories are located.')
     parser.add_argument(
         '--good_questions',