diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml new file mode 100644 index 00000000..0cd34770 --- /dev/null +++ b/.github/workflows/docs.yaml @@ -0,0 +1,30 @@ +name: ci +on: + push: + branches: + - main + +permissions: + contents: write + +jobs: + deploy: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Configure Git Credentials + run: | + git config user.name github-actions[bot] + git config user.email 41898282+github-actions[bot]@users.noreply.github.com + - uses: actions/setup-python@v4 + with: + python-version: 3.x + - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV + - uses: actions/cache@v3 + with: + key: mkdocs-material-${{ env.cache_id }} + path: .cache + restore-keys: | + mkdocs-material- + - run: pip install mkdocs-material + - run: mkdocs gh-deploy --force diff --git a/README.md b/README.md index 3b2c92c1..014521ce 100644 --- a/README.md +++ b/README.md @@ -1,66 +1,19 @@ # Fish Speech -**Documentation is under construction, English is not fully supported yet.** - -[中文文档](README.zh.md) - This codebase is released under BSD-3-Clause License, and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details. -## Disclaimer -We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws. - -## Requirements -- GPU memory: 2GB (for inference), 24GB (for finetuning) -- System: Linux (full functionality), Windows (inference only, flash-attn is not supported, torch.compile is not supported) - -Therefore, we strongly recommend to use WSL2 or docker to run the codebase for Windows users. - -## Setup -```bash -# Basic environment setup -conda create -n fish-speech python=3.10 -conda activate fish-speech -conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia - -# Install flash-attn (for linux) -pip3 install ninja && MAX_JOBS=4 pip3 install flash-attn --no-build-isolation - -# Install fish-speech -pip3 install -e . -``` - -## Inference (CLI) -Download required `vqgan` and `text2semantic` model from our huggingface repo. - -```bash -wget https://huggingface.co/fishaudio/speech-lm-v1/raw/main/vqgan-v1.pth -O checkpoints/vqgan-v1.pth -wget https://huggingface.co/fishaudio/speech-lm-v1/blob/main/text2semantic-400m-v0.2-4k.pth -O checkpoints/text2semantic-400m-v0.2-4k.pth -``` - -Generate semantic tokens from text: -```bash -python tools/llama/generate.py \ - --text "Hello" \ - --num-samples 2 \ - --compile -``` - -You may want to use `--compile` to fuse cuda kernels faster inference (~25 tokens/sec -> ~300 tokens/sec). +此代码库根据 BSD-3-Clause 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布。请参阅 [LICENSE](LICENSE) 了解更多细节. -Generate vocals from semantic tokens: -```bash -python tools/vqgan/inference.py -i codes_0.npy -``` +## Disclaimer / 免责声明 +We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws. +我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律. -## Rust Data Server -Since loading and shuffle the dataset is very slow and memory consuming, we use a rust server to load and shuffle the dataset. The server is based on GRPC and can be installed by +## Documents / 文档 +- [English](https://speech.fish.audio/en/) +- [中文](https://speech.fish.audio/zh/) -```bash -cd data_server -cargo build --release -``` -## Credits +## Credits / 鸣谢 - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) - [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) - [GPT VITS](https://github.com/innnky/gpt-vits) diff --git a/figs/diagram.png b/docs/assets/figs/diagram.png similarity index 100% rename from figs/diagram.png rename to docs/assets/figs/diagram.png diff --git a/docs/en/index.md b/docs/en/index.md new file mode 100644 index 00000000..7df7585f --- /dev/null +++ b/docs/en/index.md @@ -0,0 +1,3 @@ +# Welcome to Fish Speech + +English Document is under construction. diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 00000000..4f3fee9f --- /dev/null +++ b/docs/index.md @@ -0,0 +1,4 @@ +--- +template: redirect.html +location: /zh/ +--- diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 00000000..4c8f017d --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1 @@ +mkdocs-material diff --git a/docs/zh/index.md b/docs/zh/index.md new file mode 100644 index 00000000..a8cea7b8 --- /dev/null +++ b/docs/zh/index.md @@ -0,0 +1,85 @@ +# 介绍 + +此代码库根据 BSD-3-Clause 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布。请参阅 [LICENSE](LICENSE) 了解更多细节. + +
+ +
+ +## 免责声明 +我们不对代码库的任何非法使用承担任何责任。请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律的法律。 + +## 要求 +- GPU内存: 2GB (用于推理), 16GB (用于微调) +- 系统: Linux (全部功能), Windows (仅推理, 不支持 `flash-attn`, 不支持 `torch.compile`) + +因此, 我们强烈建议 Windows 用户使用 WSL2 或 docker 来运行代码库. + +## 设置 +```bash +# 基本环境设置 +conda create -n fish-speech python=3.10 +conda activate fish-speech +conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia + +# 安装 flash-attn (适用于linux) +pip3 install ninja && MAX_JOBS=4 pip3 install flash-attn --no-build-isolation + +# 安装 fish-speech +pip3 install -e . +``` + +## 推理 (命令行) + +从我们的 huggingface 仓库下载所需的 `vqgan` 和 `text2semantic` 模型。 + +```bash +wget https://huggingface.co/fishaudio/speech-lm-v1/raw/main/vqgan-v1.pth -O checkpoints/vqgan-v1.pth +wget https://huggingface.co/fishaudio/speech-lm-v1/blob/main/text2semantic-400m-v0.2-4k.pth -O checkpoints/text2semantic-400m-v0.2-4k.pth +``` + +### 1. [可选] 从语音生成 prompt: +```bash +python tools/vqgan/inference.py -i paimon.wav --checkpoint-path checkpoints/vqgan-v1.pth +``` + +你应该能得到一个 `fake.npy` 文件. + +### 2. 从文本生成语义 token: +```bash +python tools/llama/generate.py \ + --text "要转换的文本" \ + --prompt-text "你的参考文本" \ + --prompt-tokens "fake.npy" \ + --checkpoint-path "checkpoints/text2semantic-400m-v0.1-4k.pth" \ + --num-samples 2 \ + --compile +``` + +该命令会在工作目录下创建 `codes_N` 文件, 其中 N 是从 0 开始的整数. +您可能希望使用 `--compile` 来融合 cuda 内核以实现更快的推理 (~30 个 token/秒 -> ~500 个 token/秒). + +### 3. 从语义 token 生成人声: +```bash +python tools/vqgan/inference.py -i codes_0.npy --checkpoint-path checkpoints/vqgan-v1.pth +``` + +## Rust 数据服务器 +由于加载和打乱数据集非常缓慢且占用内存, 因此我们使用 rust 服务器来加载和打乱数据. 该服务器基于 GRPC, 可以通过以下方式安装: + +```bash +cd data_server +cargo build --release +``` + +## 更新日志 + +- 2023/12/17: 更新了 `text2semantic` 模型, 支持无音素模式. +- 2023/12/13: 测试版发布, 包含 VQGAN 模型和一个基于 LLAMA 的语言模型 (只支持音素). + +## 致谢 +- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) +- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) +- [GPT VITS](https://github.com/innnky/gpt-vits) +- [MQTTS](https://github.com/b04901014/MQTTS) +- [GPT Fast](https://github.com/pytorch-labs/gpt-fast) diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 00000000..7c3c0089 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,53 @@ +site_name: Fish Speech +repo_url: https://github.com/fishaudio/fish-speech + +theme: + name: material + language: en + features: + - navigation.instant + - navigation.instant.prefetch + - navigation.tracking + - search + - search.suggest + - search.highlight + - search.share + + palette: + # Palette toggle for automatic mode + - media: "(prefers-color-scheme)" + toggle: + icon: material/brightness-auto + name: Switch to light mode + + # Palette toggle for light mode + - media: "(prefers-color-scheme: light)" + scheme: default + toggle: + icon: material/brightness-7 + name: Switch to dark mode + primary: black + font: + code: Roboto Mono + + # Palette toggle for dark mode + - media: "(prefers-color-scheme: dark)" + scheme: slate + toggle: + icon: material/brightness-4 + name: Switch to light mode + primary: black + font: + code: Roboto Mono + +extra: + homepage: https://speech.fish.audio + version: + provider: mike + alternate: + - name: English + link: /en/ + lang: en + - name: 中文 + link: /zh/ + lang: zh