Przeglądaj źródła

[docs]:Add docs about fish agent. (#654)

* [docs]Add docs of Fish Agent.

* [docs]:Fix some issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [docs]Add Chinese docs for Fish Agent

* [docs]fix some issue

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Whale and Dolphin 1 rok temu
rodzic
commit
aaca85b3da

+ 14 - 0
README.md

@@ -34,8 +34,13 @@
 This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details.
 
 ---
+## Fish Agent
+We are very excited to annoce that we have made our self-research agent demo open source, you can now try our agent demo online at [demo](https://fish.audio/demo/live) for instant English chat and English and Chinese chat locally by following the [docs](https://speech.fish.audio/start_agent/).
+
+You should mention that the content is released under a **CC BY-NC-SA 4.0 licence**. And the demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
 
 ## Features
+### Fish Speech
 
 1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**
 
@@ -53,6 +58,13 @@ This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please
 
 8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows and MacOS, minimizing speed loss.
 
+### Fish Agent
+1. **Completely End to End:** Automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
+
+2. **Timbre Control:** Can use reference audio to control the speech timbre. 
+
+3. **Emotional:** The model can generate speech with strong emotion.
+
 ## Disclaimer
 
 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
@@ -61,6 +73,8 @@ We do not hold any responsibility for any illegal usage of the codebase. Please
 
 [Fish Audio](https://fish.audio)
 
+[Fish Agent](https://fish.audio/demo/live)
+
 ## Quick Start for Local Inference
 
 [inference.ipynb](/inference.ipynb)

+ 0 - 55
Start_Agent.md

@@ -1,55 +0,0 @@
-# How To Start?
-
-### Download Model
-
-You can get the model by:
-
-```bash
-huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
-```
-
-Put them in the 'checkpoints' folder.
-
-You also need the VQGAN weight in the fish-speech-1.4 repo.
-
-So there will be 2 folder in the checkpoints.
-
-The ``checkpoints/fish-speech-1.4`` and ``checkpoints/fish-agent-v0.1-3b``
-
-### Environment Prepare
-
-If you haven't install the environment of Fish-speech, please use:
-
-```bash
-pip install -e .[stable]
-```
-
-### Launch The Agent Demo.
-
-Please use the command below under the main folder:
-
-```bash
-python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
-```
-
-The ``--compile`` args only support Python < 3.12 , which will greatly speed up the token generation.
-
-It won't compile at once (remember).
-
-Then please use the command:
-
-```bash
-python -m tools.e2e_webui
-```
-
-This will create a Gradio WebUI on the device.
-
-When you first use the model, it will come to compile (if the ``--compile`` is True) for a short time, so please wait with patience.
-
-Have a good time!
-
-# About Agent
-
-This model is currently undergoing testing. We welcome suggestions and assistance in improving it.
-
-We are considering refining the tutorial and incorporating it into the main documentation after the testing phase is complete.

BIN
docs/assets/figs/agent_gradio.png


BIN
docs/assets/figs/logo-circle.png


+ 77 - 0
docs/en/start_agent.md

@@ -0,0 +1,77 @@
+# Start Agent
+
+## Requirements
+
+- GPU memory: At least 8GB(under quanization), 16GB or more is recommanded.
+- Disk usage: 10GB
+
+## Download Model
+
+You can get the model by:
+
+```bash
+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
+```
+
+Put them in the 'checkpoints' folder.
+
+You also need the fish-speech model which you can download instructed by [inference](inference.md).
+
+So there will be 2 folder in the checkpoints.
+
+The `checkpoints/fish-speech-1.4` and `checkpoints/fish-agent-v0.1-3b`
+
+## Environment Prepare
+
+If you already have Fish-speech, you can directly use by adding the follow instruction:
+```bash
+pip install cachetools
+```
+
+!!! note
+    Please use the Python version below 3.12 for compile.
+
+If you don't have, please use the below commands to build yout environment:
+
+```bash
+sudo apt-get install portaudio19-dev
+
+pip install -e .[stable]
+```
+
+## Launch The Agent Demo.
+
+To build fish-agent, please use the command below under the main folder:
+
+```bash
+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
+```
+
+The `--compile` args only support Python < 3.12 , which will greatly speed up the token generation.
+
+It won't compile at once (remember).
+
+Then open another terminal and use the command:
+
+```bash
+python -m tools.e2e_webui
+```
+
+This will create a Gradio WebUI on the device.
+
+When you first use the model, it will come to compile (if the `--compile` is True) for a short time, so please wait with patience.
+
+## Gradio Webui
+<p align="center">
+   <img src="../assets/figs/agent_gradio.png" width="75%">
+</p>
+
+Have a good time!
+
+## Performance
+
+Under our test, a 4060 laptop just barely runs, but is very stretched, which is only about 8 tokens/s. The 4090 is around 95 tokens/s under compile, which is what we recommend.
+
+# About Agent
+
+The demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.

+ 1 - 1
docs/ko/index.md

@@ -1,4 +1,4 @@
-# Introduction
+# 소개
 
 <div>
 <a target="_blank" href="https://discord.gg/Es5qTB9BcN">

+ 1 - 1
docs/zh/index.md

@@ -12,7 +12,7 @@
 </a>
 </div>
 
-!!! warning
+!!! warning "警告"
     我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规. <br/>
     此代码库与所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
 

+ 83 - 0
docs/zh/start_agent.md

@@ -0,0 +1,83 @@
+# 启动 Agent
+
+## 要求
+
+- GPU 显存: 至少 8GB(在量化的条件下),推荐 16GB 及以上
+- 硬盘使用量: 10GB
+
+## 下载模型
+
+你可以执行下面的语句来获取模型:
+
+```bash
+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
+```
+
+如果你处于国内网络,首先执行:
+
+```bash
+export HF_ENDPOINT=https://hf-mirror.com
+```
+
+把他们放进名为 'checkpoints' 的文件夹内。
+
+你同样需要 fish-speech 的模型,关于如何获取 fish-speech 模型请查看[inference](inference.md)。
+
+完成后你的 checkpoints 文件夹中会有两个子文件夹:`checkpoints/fish-speech-1.4` 和 `checkpoints/fish-agent-v0.1-3b`。
+
+## Environment Prepare
+
+如果你已经有了 Fish-Speech 环境,你可以在安装下面的包的前提下直接使用:
+
+```bash
+pip install cachetools
+```
+
+!!! note
+请使用小于 3.12 的 python 版本使 compile 可用
+
+如果你没有 Fish-Speech 环境,请执行下面的语句来构造你的环境:
+
+```bash
+sudo apt-get install portaudio19-dev
+
+pip install -e .[stable]
+```
+
+## 链接 Agent.
+
+你需要使用以下指令来构建 fish-agent
+
+```bash
+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
+```
+
+`--compile`只能在小于 3.12 版本的 Python 使用,这个功能可以极大程度上提高生成速度。
+
+你需要哦注意 compile 需要进行一段时间.
+
+然后启动另一个终端并执行:
+
+```bash
+python -m tools.e2e_webui
+```
+
+这会在设备上创建一个 Gradio WebUI。
+
+每当进行第一轮对话的时候,模型需要 compile 一段时间,请耐心等待
+
+## Gradio Webui
+
+<p align="center">
+   <img src="../assets/figs/agent_gradio.png" width="75%">
+</p>
+
+玩得开心!
+
+## Performance
+
+在我们的测试环境下, 4060 laptop GPU 只能刚刚运行该模型,只有大概 8 tokens/s。 4090 CPU 可以在编译后达到 95 tokens/s,我们推荐使用至少 4080 以上级别的 GPU 来达到较好体验。
+
+# About Agent
+
+该模型仍处于测试阶段。如果你发现了问题,请给我们提 issue 或者 pull request,我们非常感谢。

+ 33 - 0
mkdocs.yml

@@ -12,6 +12,7 @@ copyright: Copyright &copy; 2023-2024 by Fish Audio
 
 theme:
   name: material
+  favicon: assets/figs/logo-circle.png
   language: en
   features:
     - content.action.edit
@@ -54,6 +55,13 @@ theme:
       font:
         code: Roboto Mono
 
+nav:
+  - Introduction: index.md
+  - Finetune: finetune.md
+  - Inference: inference.md
+  - Start Agent: start_agent.md
+  - Samples: samples.md
+
 # Plugins
 plugins:
   - search:
@@ -63,6 +71,7 @@ plugins:
         - zh
         - ja
         - pt
+        - ko
   - i18n:
       docs_structure: folder
       languages:
@@ -73,12 +82,36 @@ plugins:
         - locale: zh
           name: 简体中文
           build: true
+          nav:
+            - 介绍: zh/index.md
+            - 微调: zh/finetune.md
+            - 推理: zh/inference.md
+            - 启动Agent: zh/启动Agent.md
+            - 例子: zh/samples.md
         - locale: ja
           name: 日本語
           build: true
+          nav:
+            - Fish Speech の紹介: ja/index.md
+            - 微調整: ja/finetune.md
+            - 推論: ja/inference.md
+            - サンプル: ja/samples.md
         - locale: pt
           name: Português (Brasil)
           build: true
+          nav:
+            - Introdução: pt/index.md
+            - Ajuste Fino: pt/finetune.md
+            - Inferência: pt/inference.md
+            - Amostras: pt/samples.md
+        - locale: ko
+          name: 한국어
+          build: true
+          nav:
+            - 소개: ko/index.md
+            - 파인튜닝: ko/finetune.md
+            - 추론: ko/inference.md
+            - 샘플: ko/samples.md
 
 markdown_extensions:
   - pymdownx.highlight:

+ 21 - 9
tools/e2e_webui.py

@@ -138,16 +138,28 @@ def create_demo():
                     type="messages",
                 )
 
+                # notes = gr.Markdown(
+                #     """
+                # # Fish Agent
+                # 1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
+                # 2. 你可以在我们的官方仓库找到代码以及权重,但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
+                # 3. Demo为早期灰度测试版本,推理速度尚待优化.
+                # # 特色
+                # 1. 该模型自动集成ASR与TTS部分,不需要外挂其它模型,即真正的端到端,而非三段式(ASR+LLM+TTS).
+                # 2. 模型可以使用reference audio控制说话音色.
+                # 3. 可以生成具有较强情感与韵律的音频.
+                # """
+                # )
                 notes = gr.Markdown(
                     """
-                # Fish Agent
-                1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
-                2. 你可以在我们的官方仓库找到代码以及权重,但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
-                3. Demo为早期灰度测试版本,推理速度尚待优化.
-                # 特色
-                1. 该模型自动集成ASR与TTS部分,不需要外挂其它模型,即真正的端到端,而非三段式(ASR+LLM+TTS).
-                2. 模型可以使用reference audio控制说话音色.
-                3. 可以生成具有较强情感与韵律的音频.
+                    # Fish Agent
+                    1. This demo is Fish Audio's self-researh end-to-end language model, Fish Agent version 3B.
+                    2. You can find the code and weights in our official repo in [gitub](https://github.com/fishaudio/fish-speech) and [hugging face](https://huggingface.co/fishaudio/fish-agent-v0.1-3b), but the content is released under a CC BY-NC-SA 4.0 licence.
+                    3. The demo is an early alpha test version, the inference speed needs to be optimised.
+                    # Features
+                    1. The model automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
+                    2. The model can use reference audio to control the speech timbre. 
+                    3. The model can generate speech with strong emotion.
                 """
                 )
 
@@ -160,7 +172,7 @@ def create_demo():
                 )
                 sys_text_input = gr.Textbox(
                     label="What is your assistant's role?",
-                    value='您是由 Fish Audio 设计的语音助手,提供端到端的语音交互,实现无缝用户体验。首先转录用户的语音,然后使用以下格式回答:"Question: [用户语音]\n\nResponse: [你的回答]\n"。',
+                    value="You are a voice assistant created by Fish Audio, offering end-to-end voice interaction for a seamless user experience. You are required to first transcribe the user's speech, then answer it in the following format: 'Question: [USER_SPEECH]\n\nAnswer: [YOUR_RESPONSE]\n'. You are required to use the following voice in this conversation.",
                     type="text",
                 )
                 audio_input = gr.Audio(