Просмотр исходного кода

[docs]:Add docs about fish agent. (#654)

* [docs]Add docs of Fish Agent.

* [docs]:Fix some issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [docs]Add Chinese docs for Fish Agent

* [docs]fix some issue

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Whale and Dolphin 1 год назад
Родитель
Сommit
aaca85b3da
10 измененных файлов с 230 добавлено и 66 удалено
  1. 14 0
      README.md
  2. 0 55
      Start_Agent.md
  3. BIN
      docs/assets/figs/agent_gradio.png
  4. BIN
      docs/assets/figs/logo-circle.png
  5. 77 0
      docs/en/start_agent.md
  6. 1 1
      docs/ko/index.md
  7. 1 1
      docs/zh/index.md
  8. 83 0
      docs/zh/start_agent.md
  9. 33 0
      mkdocs.yml
  10. 21 9
      tools/e2e_webui.py

+ 14 - 0
README.md

@@ -34,8 +34,13 @@
 This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details.
 This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details.
 
 
 ---
 ---
+## Fish Agent
+We are very excited to annoce that we have made our self-research agent demo open source, you can now try our agent demo online at [demo](https://fish.audio/demo/live) for instant English chat and English and Chinese chat locally by following the [docs](https://speech.fish.audio/start_agent/).
+
+You should mention that the content is released under a **CC BY-NC-SA 4.0 licence**. And the demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
 
 
 ## Features
 ## Features
+### Fish Speech
 
 
 1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**
 1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**
 
 
@@ -53,6 +58,13 @@ This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please
 
 
 8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows and MacOS, minimizing speed loss.
 8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows and MacOS, minimizing speed loss.
 
 
+### Fish Agent
+1. **Completely End to End:** Automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
+
+2. **Timbre Control:** Can use reference audio to control the speech timbre. 
+
+3. **Emotional:** The model can generate speech with strong emotion.
+
 ## Disclaimer
 ## Disclaimer
 
 
 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
@@ -61,6 +73,8 @@ We do not hold any responsibility for any illegal usage of the codebase. Please
 
 
 [Fish Audio](https://fish.audio)
 [Fish Audio](https://fish.audio)
 
 
+[Fish Agent](https://fish.audio/demo/live)
+
 ## Quick Start for Local Inference
 ## Quick Start for Local Inference
 
 
 [inference.ipynb](/inference.ipynb)
 [inference.ipynb](/inference.ipynb)

+ 0 - 55
Start_Agent.md

@@ -1,55 +0,0 @@
-# How To Start?
-
-### Download Model
-
-You can get the model by:
-
-```bash
-huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
-```
-
-Put them in the 'checkpoints' folder.
-
-You also need the VQGAN weight in the fish-speech-1.4 repo.
-
-So there will be 2 folder in the checkpoints.
-
-The ``checkpoints/fish-speech-1.4`` and ``checkpoints/fish-agent-v0.1-3b``
-
-### Environment Prepare
-
-If you haven't install the environment of Fish-speech, please use:
-
-```bash
-pip install -e .[stable]
-```
-
-### Launch The Agent Demo.
-
-Please use the command below under the main folder:
-
-```bash
-python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
-```
-
-The ``--compile`` args only support Python < 3.12 , which will greatly speed up the token generation.
-
-It won't compile at once (remember).
-
-Then please use the command:
-
-```bash
-python -m tools.e2e_webui
-```
-
-This will create a Gradio WebUI on the device.
-
-When you first use the model, it will come to compile (if the ``--compile`` is True) for a short time, so please wait with patience.
-
-Have a good time!
-
-# About Agent
-
-This model is currently undergoing testing. We welcome suggestions and assistance in improving it.
-
-We are considering refining the tutorial and incorporating it into the main documentation after the testing phase is complete.

BIN
docs/assets/figs/agent_gradio.png


BIN
docs/assets/figs/logo-circle.png


+ 77 - 0
docs/en/start_agent.md

@@ -0,0 +1,77 @@
+# Start Agent
+
+## Requirements
+
+- GPU memory: At least 8GB(under quanization), 16GB or more is recommanded.
+- Disk usage: 10GB
+
+## Download Model
+
+You can get the model by:
+
+```bash
+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
+```
+
+Put them in the 'checkpoints' folder.
+
+You also need the fish-speech model which you can download instructed by [inference](inference.md).
+
+So there will be 2 folder in the checkpoints.
+
+The `checkpoints/fish-speech-1.4` and `checkpoints/fish-agent-v0.1-3b`
+
+## Environment Prepare
+
+If you already have Fish-speech, you can directly use by adding the follow instruction:
+```bash
+pip install cachetools
+```
+
+!!! note
+    Please use the Python version below 3.12 for compile.
+
+If you don't have, please use the below commands to build yout environment:
+
+```bash
+sudo apt-get install portaudio19-dev
+
+pip install -e .[stable]
+```
+
+## Launch The Agent Demo.
+
+To build fish-agent, please use the command below under the main folder:
+
+```bash
+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
+```
+
+The `--compile` args only support Python < 3.12 , which will greatly speed up the token generation.
+
+It won't compile at once (remember).
+
+Then open another terminal and use the command:
+
+```bash
+python -m tools.e2e_webui
+```
+
+This will create a Gradio WebUI on the device.
+
+When you first use the model, it will come to compile (if the `--compile` is True) for a short time, so please wait with patience.
+
+## Gradio Webui
+<p align="center">
+   <img src="../assets/figs/agent_gradio.png" width="75%">
+</p>
+
+Have a good time!
+
+## Performance
+
+Under our test, a 4060 laptop just barely runs, but is very stretched, which is only about 8 tokens/s. The 4090 is around 95 tokens/s under compile, which is what we recommend.
+
+# About Agent
+
+The demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.

+ 1 - 1
docs/ko/index.md

@@ -1,4 +1,4 @@
-# Introduction
+# 소개
 
 
 <div>
 <div>
 <a target="_blank" href="https://discord.gg/Es5qTB9BcN">
 <a target="_blank" href="https://discord.gg/Es5qTB9BcN">

+ 1 - 1
docs/zh/index.md

@@ -12,7 +12,7 @@
 </a>
 </a>
 </div>
 </div>
 
 
-!!! warning
+!!! warning "警告"
     我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规. <br/>
     我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规. <br/>
     此代码库与所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
     此代码库与所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
 
 

+ 83 - 0
docs/zh/start_agent.md

@@ -0,0 +1,83 @@
+# 启动 Agent
+
+## 要求
+
+- GPU 显存: 至少 8GB(在量化的条件下),推荐 16GB 及以上
+- 硬盘使用量: 10GB
+
+## 下载模型
+
+你可以执行下面的语句来获取模型:
+
+```bash
+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
+```
+
+如果你处于国内网络,首先执行:
+
+```bash
+export HF_ENDPOINT=https://hf-mirror.com
+```
+
+把他们放进名为 'checkpoints' 的文件夹内。
+
+你同样需要 fish-speech 的模型,关于如何获取 fish-speech 模型请查看[inference](inference.md)。
+
+完成后你的 checkpoints 文件夹中会有两个子文件夹:`checkpoints/fish-speech-1.4` 和 `checkpoints/fish-agent-v0.1-3b`。
+
+## Environment Prepare
+
+如果你已经有了 Fish-Speech 环境,你可以在安装下面的包的前提下直接使用:
+
+```bash
+pip install cachetools
+```
+
+!!! note
+请使用小于 3.12 的 python 版本使 compile 可用
+
+如果你没有 Fish-Speech 环境,请执行下面的语句来构造你的环境:
+
+```bash
+sudo apt-get install portaudio19-dev
+
+pip install -e .[stable]
+```
+
+## 链接 Agent.
+
+你需要使用以下指令来构建 fish-agent
+
+```bash
+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
+```
+
+`--compile`只能在小于 3.12 版本的 Python 使用,这个功能可以极大程度上提高生成速度。
+
+你需要哦注意 compile 需要进行一段时间.
+
+然后启动另一个终端并执行:
+
+```bash
+python -m tools.e2e_webui
+```
+
+这会在设备上创建一个 Gradio WebUI。
+
+每当进行第一轮对话的时候,模型需要 compile 一段时间,请耐心等待
+
+## Gradio Webui
+
+<p align="center">
+   <img src="../assets/figs/agent_gradio.png" width="75%">
+</p>
+
+玩得开心!
+
+## Performance
+
+在我们的测试环境下, 4060 laptop GPU 只能刚刚运行该模型,只有大概 8 tokens/s。 4090 CPU 可以在编译后达到 95 tokens/s,我们推荐使用至少 4080 以上级别的 GPU 来达到较好体验。
+
+# About Agent
+
+该模型仍处于测试阶段。如果你发现了问题,请给我们提 issue 或者 pull request,我们非常感谢。

+ 33 - 0
mkdocs.yml

@@ -12,6 +12,7 @@ copyright: Copyright &copy; 2023-2024 by Fish Audio
 
 
 theme:
 theme:
   name: material
   name: material
+  favicon: assets/figs/logo-circle.png
   language: en
   language: en
   features:
   features:
     - content.action.edit
     - content.action.edit
@@ -54,6 +55,13 @@ theme:
       font:
       font:
         code: Roboto Mono
         code: Roboto Mono
 
 
+nav:
+  - Introduction: index.md
+  - Finetune: finetune.md
+  - Inference: inference.md
+  - Start Agent: start_agent.md
+  - Samples: samples.md
+
 # Plugins
 # Plugins
 plugins:
 plugins:
   - search:
   - search:
@@ -63,6 +71,7 @@ plugins:
         - zh
         - zh
         - ja
         - ja
         - pt
         - pt
+        - ko
   - i18n:
   - i18n:
       docs_structure: folder
       docs_structure: folder
       languages:
       languages:
@@ -73,12 +82,36 @@ plugins:
         - locale: zh
         - locale: zh
           name: 简体中文
           name: 简体中文
           build: true
           build: true
+          nav:
+            - 介绍: zh/index.md
+            - 微调: zh/finetune.md
+            - 推理: zh/inference.md
+            - 启动Agent: zh/启动Agent.md
+            - 例子: zh/samples.md
         - locale: ja
         - locale: ja
           name: 日本語
           name: 日本語
           build: true
           build: true
+          nav:
+            - Fish Speech の紹介: ja/index.md
+            - 微調整: ja/finetune.md
+            - 推論: ja/inference.md
+            - サンプル: ja/samples.md
         - locale: pt
         - locale: pt
           name: Português (Brasil)
           name: Português (Brasil)
           build: true
           build: true
+          nav:
+            - Introdução: pt/index.md
+            - Ajuste Fino: pt/finetune.md
+            - Inferência: pt/inference.md
+            - Amostras: pt/samples.md
+        - locale: ko
+          name: 한국어
+          build: true
+          nav:
+            - 소개: ko/index.md
+            - 파인튜닝: ko/finetune.md
+            - 추론: ko/inference.md
+            - 샘플: ko/samples.md
 
 
 markdown_extensions:
 markdown_extensions:
   - pymdownx.highlight:
   - pymdownx.highlight:

+ 21 - 9
tools/e2e_webui.py

@@ -138,16 +138,28 @@ def create_demo():
                     type="messages",
                     type="messages",
                 )
                 )
 
 
+                # notes = gr.Markdown(
+                #     """
+                # # Fish Agent
+                # 1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
+                # 2. 你可以在我们的官方仓库找到代码以及权重,但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
+                # 3. Demo为早期灰度测试版本,推理速度尚待优化.
+                # # 特色
+                # 1. 该模型自动集成ASR与TTS部分,不需要外挂其它模型,即真正的端到端,而非三段式(ASR+LLM+TTS).
+                # 2. 模型可以使用reference audio控制说话音色.
+                # 3. 可以生成具有较强情感与韵律的音频.
+                # """
+                # )
                 notes = gr.Markdown(
                 notes = gr.Markdown(
                     """
                     """
-                # Fish Agent
-                1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
-                2. 你可以在我们的官方仓库找到代码以及权重,但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
-                3. Demo为早期灰度测试版本,推理速度尚待优化.
-                # 特色
-                1. 该模型自动集成ASR与TTS部分,不需要外挂其它模型,即真正的端到端,而非三段式(ASR+LLM+TTS).
-                2. 模型可以使用reference audio控制说话音色.
-                3. 可以生成具有较强情感与韵律的音频.
+                    # Fish Agent
+                    1. This demo is Fish Audio's self-researh end-to-end language model, Fish Agent version 3B.
+                    2. You can find the code and weights in our official repo in [gitub](https://github.com/fishaudio/fish-speech) and [hugging face](https://huggingface.co/fishaudio/fish-agent-v0.1-3b), but the content is released under a CC BY-NC-SA 4.0 licence.
+                    3. The demo is an early alpha test version, the inference speed needs to be optimised.
+                    # Features
+                    1. The model automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
+                    2. The model can use reference audio to control the speech timbre. 
+                    3. The model can generate speech with strong emotion.
                 """
                 """
                 )
                 )
 
 
@@ -160,7 +172,7 @@ def create_demo():
                 )
                 )
                 sys_text_input = gr.Textbox(
                 sys_text_input = gr.Textbox(
                     label="What is your assistant's role?",
                     label="What is your assistant's role?",
-                    value='您是由 Fish Audio 设计的语音助手,提供端到端的语音交互,实现无缝用户体验。首先转录用户的语音,然后使用以下格式回答:"Question: [用户语音]\n\nResponse: [你的回答]\n"。',
+                    value="You are a voice assistant created by Fish Audio, offering end-to-end voice interaction for a seamless user experience. You are required to first transcribe the user's speech, then answer it in the following format: 'Question: [USER_SPEECH]\n\nAnswer: [YOUR_RESPONSE]\n'. You are required to use the following voice in this conversation.",
                     type="text",
                     type="text",
                 )
                 )
                 audio_input = gr.Audio(
                 audio_input = gr.Audio(