1 rok temu · aaca85b3da
--- a/README.md
+++ b/README.md
@@ -34,8 +34,13 @@
 
															 This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details.
														
 
															 ---
														
 
															+## Fish Agent
														
 
															+We are very excited to annoce that we have made our self-research agent demo open source, you can now try our agent demo online at [demo](https://fish.audio/demo/live) for instant English chat and English and Chinese chat locally by following the [docs](https://speech.fish.audio/start_agent/).
														
 
															+
														
 
															+You should mention that the content is released under a **CC BY-NC-SA 4.0 licence**. And the demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
														
 
															 ## Features
														
 
															+### Fish Speech
														
 
															 1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**
														
@@ -53,6 +58,13 @@ This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please
 
															 8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows and MacOS, minimizing speed loss.
														
 
															+### Fish Agent
														
 
															+1. **Completely End to End:** Automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
														
 
															+
														
 
															+2. **Timbre Control:** Can use reference audio to control the speech timbre. 
														
 
															+
														
 
															+3. **Emotional:** The model can generate speech with strong emotion.
														
 
															+
														
 
															 ## Disclaimer
														
 
															 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
														
@@ -61,6 +73,8 @@ We do not hold any responsibility for any illegal usage of the codebase. Please
 
															 [Fish Audio](https://fish.audio)
														
 
															+[Fish Agent](https://fish.audio/demo/live)
														
 
															+
														
 
															 ## Quick Start for Local Inference
														
 
															 [inference.ipynb](/inference.ipynb)
														
--- a/Start_Agent.md
+++ b/Start_Agent.md
@@ -1,55 +0,0 @@
 
															-# How To Start?
														
 
															-
														
 
															-### Download Model
														
 
															-
														
 
															-You can get the model by:
														
 
															-
														
 
															-```bash
														
 
															-huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
														
 
															-```
														
 
															-
														
 
															-Put them in the 'checkpoints' folder.
														
 
															-
														
 
															-You also need the VQGAN weight in the fish-speech-1.4 repo.
														
 
															-
														
 
															-So there will be 2 folder in the checkpoints.
														
 
															-
														
 
															-The ``checkpoints/fish-speech-1.4`` and ``checkpoints/fish-agent-v0.1-3b``
														
 
															-
														
 
															-### Environment Prepare
														
 
															-
														
 
															-If you haven't install the environment of Fish-speech, please use:
														
 
															-
														
 
															-```bash
														
 
															-pip install -e .[stable]
														
 
															-```
														
 
															-
														
 
															-### Launch The Agent Demo.
														
 
															-
														
 
															-Please use the command below under the main folder:
														
 
															-
														
 
															-```bash
														
 
															-python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
														
 
															-```
														
 
															-
														
 
															-The ``--compile`` args only support Python < 3.12 , which will greatly speed up the token generation.
														
 
															-
														
 
															-It won't compile at once (remember).
														
 
															-
														
 
															-Then please use the command:
														
 
															-
														
 
															-```bash
														
 
															-python -m tools.e2e_webui
														
 
															-```
														
 
															-
														
 
															-This will create a Gradio WebUI on the device.
														
 
															-
														
 
															-When you first use the model, it will come to compile (if the ``--compile`` is True) for a short time, so please wait with patience.
														
 
															-
														
 
															-Have a good time!
														
 
															-
														
 
															-# About Agent
														
 
															-
														
 
															-This model is currently undergoing testing. We welcome suggestions and assistance in improving it.
														
 
															-
														
 
															-We are considering refining the tutorial and incorporating it into the main documentation after the testing phase is complete.
														
--- a/docs/assets/figs/agent_gradio.png
+++ b/docs/assets/figs/agent_gradio.png
--- a/docs/assets/figs/logo-circle.png
+++ b/docs/assets/figs/logo-circle.png
--- a/docs/en/start_agent.md
+++ b/docs/en/start_agent.md
@@ -0,0 +1,77 @@
 
															+# Start Agent
														
 
															+
														
 
															+## Requirements
														
 
															+
														
 
															+- GPU memory: At least 8GB(under quanization), 16GB or more is recommanded.
														
 
															+- Disk usage: 10GB
														
 
															+
														
 
															+## Download Model
														
 
															+
														
 
															+You can get the model by:
														
 
															+
														
 
															+```bash
														
 
															+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
														
 
															+```
														
 
															+
														
 
															+Put them in the 'checkpoints' folder.
														
 
															+
														
 
															+You also need the fish-speech model which you can download instructed by [inference](inference.md).
														
 
															+
														
 
															+So there will be 2 folder in the checkpoints.
														
 
															+
														
 
															+The `checkpoints/fish-speech-1.4` and `checkpoints/fish-agent-v0.1-3b`
														
 
															+
														
 
															+## Environment Prepare
														
 
															+
														
 
															+If you already have Fish-speech, you can directly use by adding the follow instruction:
														
 
															+```bash
														
 
															+pip install cachetools
														
 
															+```
														
 
															+
														
 
															+!!! note
														
 
															+    Please use the Python version below 3.12 for compile.
														
 
															+
														
 
															+If you don't have, please use the below commands to build yout environment:
														
 
															+
														
 
															+```bash
														
 
															+sudo apt-get install portaudio19-dev
														
 
															+
														
 
															+pip install -e .[stable]
														
 
															+```
														
 
															+
														
 
															+## Launch The Agent Demo.
														
 
															+
														
 
															+To build fish-agent, please use the command below under the main folder:
														
 
															+
														
 
															+```bash
														
 
															+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
														
 
															+```
														
 
															+
														
 
															+The `--compile` args only support Python < 3.12 , which will greatly speed up the token generation.
														
 
															+
														
 
															+It won't compile at once (remember).
														
 
															+
														
 
															+Then open another terminal and use the command:
														
 
															+
														
 
															+```bash
														
 
															+python -m tools.e2e_webui
														
 
															+```
														
 
															+
														
 
															+This will create a Gradio WebUI on the device.
														
 
															+
														
 
															+When you first use the model, it will come to compile (if the `--compile` is True) for a short time, so please wait with patience.
														
 
															+
														
 
															+## Gradio Webui
														
 
															+<p align="center">
														
 
															+   <img src="../assets/figs/agent_gradio.png" width="75%">
														
 
															+</p>
														
 
															+
														
 
															+Have a good time!
														
 
															+
														
 
															+## Performance
														
 
															+
														
 
															+Under our test, a 4060 laptop just barely runs, but is very stretched, which is only about 8 tokens/s. The 4090 is around 95 tokens/s under compile, which is what we recommend.
														
 
															+
														
 
															+# About Agent
														
 
															+
														
 
															+The demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
														
--- a/docs/ko/index.md
+++ b/docs/ko/index.md
@@ -1,4 +1,4 @@
 
															-# Introduction
														
 
															+# 소개
														
 
															 <div>
														
 
															 <a target="_blank" href="https://discord.gg/Es5qTB9BcN">
														
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -12,7 +12,7 @@
 
															 </a>
														
 
															 </div>
														
 
															-!!! warning
														
 
															+!!! warning "警告"
														
 
															     我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规. <br/>
														
 
															     此代码库与所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
														
--- a/docs/zh/start_agent.md
+++ b/docs/zh/start_agent.md
@@ -0,0 +1,83 @@
 
															+# 启动 Agent
														
 
															+
														
 
															+## 要求
														
 
															+
														
 
															+- GPU 显存: 至少 8GB（在量化的条件下），推荐 16GB 及以上
														
 
															+- 硬盘使用量: 10GB
														
 
															+
														
 
															+## 下载模型
														
 
															+
														
 
															+你可以执行下面的语句来获取模型:
														
 
															+
														
 
															+```bash
														
 
															+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
														
 
															+```
														
 
															+
														
 
															+如果你处于国内网络，首先执行:
														
 
															+
														
 
															+```bash
														
 
															+export HF_ENDPOINT=https://hf-mirror.com
														
 
															+```
														
 
															+
														
 
															+把他们放进名为 'checkpoints' 的文件夹内。
														
 
															+
														
 
															+你同样需要 fish-speech 的模型，关于如何获取 fish-speech 模型请查看[inference](inference.md)。
														
 
															+
														
 
															+完成后你的 checkpoints 文件夹中会有两个子文件夹：`checkpoints/fish-speech-1.4` 和 `checkpoints/fish-agent-v0.1-3b`。
														
 
															+
														
 
															+## Environment Prepare
														
 
															+
														
 
															+如果你已经有了 Fish-Speech 环境，你可以在安装下面的包的前提下直接使用：
														
 
															+
														
 
															+```bash
														
 
															+pip install cachetools
														
 
															+```
														
 
															+
														
 
															+!!! note
														
 
															+请使用小于 3.12 的 python 版本使 compile 可用
														
 
															+
														
 
															+如果你没有 Fish-Speech 环境，请执行下面的语句来构造你的环境：
														
 
															+
														
 
															+```bash
														
 
															+sudo apt-get install portaudio19-dev
														
 
															+
														
 
															+pip install -e .[stable]
														
 
															+```
														
 
															+
														
 
															+## 链接 Agent.
														
 
															+
														
 
															+你需要使用以下指令来构建 fish-agent
														
 
															+
														
 
															+```bash
														
 
															+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
														
 
															+```
														
 
															+
														
 
															+`--compile`只能在小于 3.12 版本的 Python 使用，这个功能可以极大程度上提高生成速度。
														
 
															+
														
 
															+你需要哦注意 compile 需要进行一段时间.
														
 
															+
														
 
															+然后启动另一个终端并执行:
														
 
															+
														
 
															+```bash
														
 
															+python -m tools.e2e_webui
														
 
															+```
														
 
															+
														
 
															+这会在设备上创建一个 Gradio WebUI。
														
 
															+
														
 
															+每当进行第一轮对话的时候，模型需要 compile 一段时间，请耐心等待
														
 
															+
														
 
															+## Gradio Webui
														
 
															+
														
 
															+<p align="center">
														
 
															+   <img src="../assets/figs/agent_gradio.png" width="75%">
														
 
															+</p>
														
 
															+
														
 
															+玩得开心！
														
 
															+
														
 
															+## Performance
														
 
															+
														
 
															+在我们的测试环境下， 4060 laptop GPU 只能刚刚运行该模型，只有大概 8 tokens/s。 4090 CPU 可以在编译后达到 95 tokens/s，我们推荐使用至少 4080 以上级别的 GPU 来达到较好体验。
														
 
															+
														
 
															+# About Agent
														
 
															+
														
 
															+该模型仍处于测试阶段。如果你发现了问题，请给我们提 issue 或者 pull request，我们非常感谢。
														
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -12,6 +12,7 @@ copyright: Copyright &copy; 2023-2024 by Fish Audio
 
															 theme:
														
 
															   name: material
														
 
															+  favicon: assets/figs/logo-circle.png
														
 
															   language: en
														
 
															   features:
														
 
															     - content.action.edit
														
@@ -54,6 +55,13 @@ theme:
 
															       font:
														
 
															         code: Roboto Mono
														
 
															+nav:
														
 
															+  - Introduction: index.md
														
 
															+  - Finetune: finetune.md
														
 
															+  - Inference: inference.md
														
 
															+  - Start Agent: start_agent.md
														
 
															+  - Samples: samples.md
														
 
															+
														
 
															 # Plugins
														
 
															 plugins:
														
 
															   - search:
														
@@ -63,6 +71,7 @@ plugins:
 
															         - zh
														
 
															         - ja
														
 
															         - pt
														
 
															+        - ko
														
 
															   - i18n:
														
 
															       docs_structure: folder
														
 
															       languages:
														
@@ -73,12 +82,36 @@ plugins:
 
															         - locale: zh
														
 
															           name: 简体中文
														
 
															           build: true
														
 
															+          nav:
														
 
															+            - 介绍: zh/index.md
														
 
															+            - 微调: zh/finetune.md
														
 
															+            - 推理: zh/inference.md
														
 
															+            - 启动Agent: zh/启动Agent.md
														
 
															+            - 例子: zh/samples.md
														
 
															         - locale: ja
														
 
															           name: 日本語
														
 
															           build: true
														
 
															+          nav:
														
 
															+            - Fish Speech の紹介: ja/index.md
														
 
															+            - 微調整: ja/finetune.md
														
 
															+            - 推論: ja/inference.md
														
 
															+            - サンプル: ja/samples.md
														
 
															         - locale: pt
														
 
															           name: Português (Brasil)
														
 
															           build: true
														
 
															+          nav:
														
 
															+            - Introdução: pt/index.md
														
 
															+            - Ajuste Fino: pt/finetune.md
														
 
															+            - Inferência: pt/inference.md
														
 
															+            - Amostras: pt/samples.md
														
 
															+        - locale: ko
														
 
															+          name: 한국어
														
 
															+          build: true
														
 
															+          nav:
														
 
															+            - 소개: ko/index.md
														
 
															+            - 파인튜닝: ko/finetune.md
														
 
															+            - 추론: ko/inference.md
														
 
															+            - 샘플: ko/samples.md
														
 
															 markdown_extensions:
														
 
															   - pymdownx.highlight:
														
--- a/tools/e2e_webui.py
+++ b/tools/e2e_webui.py
@@ -138,16 +138,28 @@ def create_demo():
 
															                     type="messages",
														
 
															                 )
														
 
															+                # notes = gr.Markdown(
														
 
															+                #     """
														
 
															+                # # Fish Agent
														
 
															+                # 1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
														
 
															+                # 2. 你可以在我们的官方仓库找到代码以及权重，但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
														
 
															+                # 3. Demo为早期灰度测试版本，推理速度尚待优化.
														
 
															+                # # 特色
														
 
															+                # 1. 该模型自动集成ASR与TTS部分，不需要外挂其它模型，即真正的端到端，而非三段式(ASR+LLM+TTS).
														
 
															+                # 2. 模型可以使用reference audio控制说话音色.
														
 
															+                # 3. 可以生成具有较强情感与韵律的音频.
														
 
															+                # """
														
 
															+                # )
														
 
															                 notes = gr.Markdown(
														
 
															                     """
														
 
															-                # Fish Agent
														
 
															-                1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
														
 
															-                2. 你可以在我们的官方仓库找到代码以及权重，但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
														
 
															-                3. Demo为早期灰度测试版本，推理速度尚待优化.
														
 
															-                # 特色
														
 
															-                1. 该模型自动集成ASR与TTS部分，不需要外挂其它模型，即真正的端到端，而非三段式(ASR+LLM+TTS).
														
 
															-                2. 模型可以使用reference audio控制说话音色.
														
 
															-                3. 可以生成具有较强情感与韵律的音频.
														
 
															+                    # Fish Agent
														
 
															+                    1. This demo is Fish Audio's self-researh end-to-end language model, Fish Agent version 3B.
														
 
															+                    2. You can find the code and weights in our official repo in [gitub](https://github.com/fishaudio/fish-speech) and [hugging face](https://huggingface.co/fishaudio/fish-agent-v0.1-3b), but the content is released under a CC BY-NC-SA 4.0 licence.
														
 
															+                    3. The demo is an early alpha test version, the inference speed needs to be optimised.
														
 
															+                    # Features
														
 
															+                    1. The model automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
														
 
															+                    2. The model can use reference audio to control the speech timbre. 
														
 
															+                    3. The model can generate speech with strong emotion.
														
 
															                 """
														
 
															                 )
														
@@ -160,7 +172,7 @@ def create_demo():
 
															                 )
														
 
															                 sys_text_input = gr.Textbox(
														
 
															                     label="What is your assistant's role?",
														
 
															-                    value='您是由 Fish Audio 设计的语音助手，提供端到端的语音交互，实现无缝用户体验。首先转录用户的语音，然后使用以下格式回答："Question: [用户语音]\n\nResponse: [你的回答]\n"。',
														
 
															+                    value="You are a voice assistant created by Fish Audio, offering end-to-end voice interaction for a seamless user experience. You are required to first transcribe the user's speech, then answer it in the following format: 'Question: [USER_SPEECH]\n\nAnswer: [YOUR_RESPONSE]\n'. You are required to use the following voice in this conversation.",
														
 
															                     type="text",
														
 
															                 )
														
 
															                 audio_input = gr.Audio(