Skip to content

Commit

Permalink
[docs]:Add docs about fish agent. (#654)
Browse files Browse the repository at this point in the history
* [docs]Add docs of Fish Agent.

* [docs]:Fix some issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [docs]Add Chinese docs for Fish Agent

* [docs]fix some issue

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
Whale-Dolphin and pre-commit-ci[bot] authored Nov 5, 2024
1 parent ec2c5b7 commit aaca85b
Show file tree
Hide file tree
Showing 10 changed files with 230 additions and 66 deletions.
14 changes: 14 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,13 @@
This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details.

---
## Fish Agent
We are very excited to annoce that we have made our self-research agent demo open source, you can now try our agent demo online at [demo](https://fish.audio/demo/live) for instant English chat and English and Chinese chat locally by following the [docs](https://speech.fish.audio/start_agent/).

You should mention that the content is released under a **CC BY-NC-SA 4.0 licence**. And the demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.

## Features
### Fish Speech

1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**

Expand All @@ -53,6 +58,13 @@ This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please

8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows and MacOS, minimizing speed loss.

### Fish Agent
1. **Completely End to End:** Automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).

2. **Timbre Control:** Can use reference audio to control the speech timbre.

3. **Emotional:** The model can generate speech with strong emotion.

## Disclaimer

We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
Expand All @@ -61,6 +73,8 @@ We do not hold any responsibility for any illegal usage of the codebase. Please

[Fish Audio](https://fish.audio)

[Fish Agent](https://fish.audio/demo/live)

## Quick Start for Local Inference

[inference.ipynb](/inference.ipynb)
Expand Down
55 changes: 0 additions & 55 deletions Start_Agent.md

This file was deleted.

Binary file added docs/assets/figs/agent_gradio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/figs/logo-circle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
77 changes: 77 additions & 0 deletions docs/en/start_agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Start Agent

## Requirements

- GPU memory: At least 8GB(under quanization), 16GB or more is recommanded.
- Disk usage: 10GB

## Download Model

You can get the model by:

```bash
huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
```

Put them in the 'checkpoints' folder.

You also need the fish-speech model which you can download instructed by [inference](inference.md).

So there will be 2 folder in the checkpoints.

The `checkpoints/fish-speech-1.4` and `checkpoints/fish-agent-v0.1-3b`

## Environment Prepare

If you already have Fish-speech, you can directly use by adding the follow instruction:
```bash
pip install cachetools
```

!!! note
Please use the Python version below 3.12 for compile.

If you don't have, please use the below commands to build yout environment:

```bash
sudo apt-get install portaudio19-dev

pip install -e .[stable]
```

## Launch The Agent Demo.

To build fish-agent, please use the command below under the main folder:

```bash
python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
```

The `--compile` args only support Python < 3.12 , which will greatly speed up the token generation.

It won't compile at once (remember).

Then open another terminal and use the command:

```bash
python -m tools.e2e_webui
```

This will create a Gradio WebUI on the device.

When you first use the model, it will come to compile (if the `--compile` is True) for a short time, so please wait with patience.

## Gradio Webui
<p align="center">
<img src="../assets/figs/agent_gradio.png" width="75%">
</p>

Have a good time!

## Performance

Under our test, a 4060 laptop just barely runs, but is very stretched, which is only about 8 tokens/s. The 4090 is around 95 tokens/s under compile, which is what we recommend.

# About Agent

The demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
2 changes: 1 addition & 1 deletion docs/ko/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Introduction
# 소개

<div>
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
Expand Down
2 changes: 1 addition & 1 deletion docs/zh/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
</a>
</div>

!!! warning
!!! warning "警告"
我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规. <br/>
此代码库与所有模型根据 CC-BY-NC-SA-4.0 许可证发布.

Expand Down
83 changes: 83 additions & 0 deletions docs/zh/start_agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# 启动 Agent

## 要求

- GPU 显存: 至少 8GB(在量化的条件下),推荐 16GB 及以上
- 硬盘使用量: 10GB

## 下载模型

你可以执行下面的语句来获取模型:

```bash
huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
```

如果你处于国内网络,首先执行:

```bash
export HF_ENDPOINT=https://hf-mirror.com
```

把他们放进名为 'checkpoints' 的文件夹内。

你同样需要 fish-speech 的模型,关于如何获取 fish-speech 模型请查看[inference](inference.md)

完成后你的 checkpoints 文件夹中会有两个子文件夹:`checkpoints/fish-speech-1.4``checkpoints/fish-agent-v0.1-3b`

## Environment Prepare

如果你已经有了 Fish-Speech 环境,你可以在安装下面的包的前提下直接使用:

```bash
pip install cachetools
```

!!! note
请使用小于 3.12 的 python 版本使 compile 可用

如果你没有 Fish-Speech 环境,请执行下面的语句来构造你的环境:

```bash
sudo apt-get install portaudio19-dev

pip install -e .[stable]
```

## 链接 Agent.

你需要使用以下指令来构建 fish-agent

```bash
python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
```

`--compile`只能在小于 3.12 版本的 Python 使用,这个功能可以极大程度上提高生成速度。

你需要哦注意 compile 需要进行一段时间.

然后启动另一个终端并执行:

```bash
python -m tools.e2e_webui
```

这会在设备上创建一个 Gradio WebUI。

每当进行第一轮对话的时候,模型需要 compile 一段时间,请耐心等待

## Gradio Webui

<p align="center">
<img src="../assets/figs/agent_gradio.png" width="75%">
</p>

玩得开心!

## Performance

在我们的测试环境下, 4060 laptop GPU 只能刚刚运行该模型,只有大概 8 tokens/s。 4090 CPU 可以在编译后达到 95 tokens/s,我们推荐使用至少 4080 以上级别的 GPU 来达到较好体验。

# About Agent

该模型仍处于测试阶段。如果你发现了问题,请给我们提 issue 或者 pull request,我们非常感谢。
33 changes: 33 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ copyright: Copyright &copy; 2023-2024 by Fish Audio

theme:
name: material
favicon: assets/figs/logo-circle.png
language: en
features:
- content.action.edit
Expand Down Expand Up @@ -54,6 +55,13 @@ theme:
font:
code: Roboto Mono

nav:
- Introduction: index.md
- Finetune: finetune.md
- Inference: inference.md
- Start Agent: start_agent.md
- Samples: samples.md

# Plugins
plugins:
- search:
Expand All @@ -63,6 +71,7 @@ plugins:
- zh
- ja
- pt
- ko
- i18n:
docs_structure: folder
languages:
Expand All @@ -73,12 +82,36 @@ plugins:
- locale: zh
name: 简体中文
build: true
nav:
- 介绍: zh/index.md
- 微调: zh/finetune.md
- 推理: zh/inference.md
- 启动Agent: zh/启动Agent.md
- 例子: zh/samples.md
- locale: ja
name: 日本語
build: true
nav:
- Fish Speech の紹介: ja/index.md
- 微調整: ja/finetune.md
- 推論: ja/inference.md
- サンプル: ja/samples.md
- locale: pt
name: Português (Brasil)
build: true
nav:
- Introdução: pt/index.md
- Ajuste Fino: pt/finetune.md
- Inferência: pt/inference.md
- Amostras: pt/samples.md
- locale: ko
name: 한국어
build: true
nav:
- 소개: ko/index.md
- 파인튜닝: ko/finetune.md
- 추론: ko/inference.md
- 샘플: ko/samples.md

markdown_extensions:
- pymdownx.highlight:
Expand Down
30 changes: 21 additions & 9 deletions tools/e2e_webui.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,16 +138,28 @@ def create_demo():
type="messages",
)

# notes = gr.Markdown(
# """
# # Fish Agent
# 1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
# 2. 你可以在我们的官方仓库找到代码以及权重,但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
# 3. Demo为早期灰度测试版本,推理速度尚待优化.
# # 特色
# 1. 该模型自动集成ASR与TTS部分,不需要外挂其它模型,即真正的端到端,而非三段式(ASR+LLM+TTS).
# 2. 模型可以使用reference audio控制说话音色.
# 3. 可以生成具有较强情感与韵律的音频.
# """
# )
notes = gr.Markdown(
"""
# Fish Agent
1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
2. 你可以在我们的官方仓库找到代码以及权重,但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
3. Demo为早期灰度测试版本,推理速度尚待优化.
# 特色
1. 该模型自动集成ASR与TTS部分,不需要外挂其它模型,即真正的端到端,而非三段式(ASR+LLM+TTS).
2. 模型可以使用reference audio控制说话音色.
3. 可以生成具有较强情感与韵律的音频.
# Fish Agent
1. This demo is Fish Audio's self-researh end-to-end language model, Fish Agent version 3B.
2. You can find the code and weights in our official repo in [gitub](https://github.com/fishaudio/fish-speech) and [hugging face](https://huggingface.co/fishaudio/fish-agent-v0.1-3b), but the content is released under a CC BY-NC-SA 4.0 licence.
3. The demo is an early alpha test version, the inference speed needs to be optimised.
# Features
1. The model automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
2. The model can use reference audio to control the speech timbre.
3. The model can generate speech with strong emotion.
"""
)

Expand All @@ -160,7 +172,7 @@ def create_demo():
)
sys_text_input = gr.Textbox(
label="What is your assistant's role?",
value='您是由 Fish Audio 设计的语音助手,提供端到端的语音交互,实现无缝用户体验。首先转录用户的语音,然后使用以下格式回答:"Question: [用户语音]\n\nResponse: [你的回答]\n"。',
value="You are a voice assistant created by Fish Audio, offering end-to-end voice interaction for a seamless user experience. You are required to first transcribe the user's speech, then answer it in the following format: 'Question: [USER_SPEECH]\n\nAnswer: [YOUR_RESPONSE]\n'. You are required to use the following voice in this conversation.",
type="text",
)
audio_input = gr.Audio(
Expand Down

0 comments on commit aaca85b

Please sign in to comment.