1 rok temu · aaca85b3da
--- a/README.md
+++ b/README.md
@@ -34,8 +34,13 @@
 
				 This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details.
			
 
				 
			
 
				 ---
			
 
				+## Fish Agent
			
 
				+We are very excited to annoce that we have made our self-research agent demo open source, you can now try our agent demo online at [demo](https://fish.audio/demo/live) for instant English chat and English and Chinese chat locally by following the [docs](https://speech.fish.audio/start_agent/).
			
 
				+
			
 
				+You should mention that the content is released under a **CC BY-NC-SA 4.0 licence**. And the demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
			
 
				 
			
 
				 ## Features
			
 
				+### Fish Speech
			
 
				 
			
 
				 1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**
			
 
				 
			
@@ -53,6 +58,13 @@ This codebase and all models are released under CC-BY-NC-SA-4.0 License. Please
 
				 
			
 
				 8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows and MacOS, minimizing speed loss.
			
 
				 
			
 
				+### Fish Agent
			
 
				+1. **Completely End to End:** Automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
			
 
				+
			
 
				+2. **Timbre Control:** Can use reference audio to control the speech timbre. 
			
 
				+
			
 
				+3. **Emotional:** The model can generate speech with strong emotion.
			
 
				+
			
 
				 ## Disclaimer
			
 
				 
			
 
				 We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
			
@@ -61,6 +73,8 @@ We do not hold any responsibility for any illegal usage of the codebase. Please
 
				 
			
 
				 [Fish Audio](https://fish.audio)
			
 
				 
			
 
				+[Fish Agent](https://fish.audio/demo/live)
			
 
				+
			
 
				 ## Quick Start for Local Inference
			
 
				 
			
 
				 [inference.ipynb](/inference.ipynb)
			
--- a/Start_Agent.md
+++ b/Start_Agent.md
@@ -1,55 +0,0 @@
 
				-# How To Start?
			
 
				-
			
 
				-### Download Model
			
 
				-
			
 
				-You can get the model by:
			
 
				-
			
 
				-```bash
			
 
				-huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
			
 
				-```
			
 
				-
			
 
				-Put them in the 'checkpoints' folder.
			
 
				-
			
 
				-You also need the VQGAN weight in the fish-speech-1.4 repo.
			
 
				-
			
 
				-So there will be 2 folder in the checkpoints.
			
 
				-
			
 
				-The ``checkpoints/fish-speech-1.4`` and ``checkpoints/fish-agent-v0.1-3b``
			
 
				-
			
 
				-### Environment Prepare
			
 
				-
			
 
				-If you haven't install the environment of Fish-speech, please use:
			
 
				-
			
 
				-```bash
			
 
				-pip install -e .[stable]
			
 
				-```
			
 
				-
			
 
				-### Launch The Agent Demo.
			
 
				-
			
 
				-Please use the command below under the main folder:
			
 
				-
			
 
				-```bash
			
 
				-python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
			
 
				-```
			
 
				-
			
 
				-The ``--compile`` args only support Python < 3.12 , which will greatly speed up the token generation.
			
 
				-
			
 
				-It won't compile at once (remember).
			
 
				-
			
 
				-Then please use the command:
			
 
				-
			
 
				-```bash
			
 
				-python -m tools.e2e_webui
			
 
				-```
			
 
				-
			
 
				-This will create a Gradio WebUI on the device.
			
 
				-
			
 
				-When you first use the model, it will come to compile (if the ``--compile`` is True) for a short time, so please wait with patience.
			
 
				-
			
 
				-Have a good time!
			
 
				-
			
 
				-# About Agent
			
 
				-
			
 
				-This model is currently undergoing testing. We welcome suggestions and assistance in improving it.
			
 
				-
			
 
				-We are considering refining the tutorial and incorporating it into the main documentation after the testing phase is complete.
			
--- a/docs/assets/figs/agent_gradio.png
+++ b/docs/assets/figs/agent_gradio.png
--- a/docs/assets/figs/logo-circle.png
+++ b/docs/assets/figs/logo-circle.png
--- a/docs/en/start_agent.md
+++ b/docs/en/start_agent.md
@@ -0,0 +1,77 @@
 
				+# Start Agent
			
 
				+
			
 
				+## Requirements
			
 
				+
			
 
				+- GPU memory: At least 8GB(under quanization), 16GB or more is recommanded.
			
 
				+- Disk usage: 10GB
			
 
				+
			
 
				+## Download Model
			
 
				+
			
 
				+You can get the model by:
			
 
				+
			
 
				+```bash
			
 
				+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
			
 
				+```
			
 
				+
			
 
				+Put them in the 'checkpoints' folder.
			
 
				+
			
 
				+You also need the fish-speech model which you can download instructed by [inference](inference.md).
			
 
				+
			
 
				+So there will be 2 folder in the checkpoints.
			
 
				+
			
 
				+The `checkpoints/fish-speech-1.4` and `checkpoints/fish-agent-v0.1-3b`
			
 
				+
			
 
				+## Environment Prepare
			
 
				+
			
 
				+If you already have Fish-speech, you can directly use by adding the follow instruction:
			
 
				+```bash
			
 
				+pip install cachetools
			
 
				+```
			
 
				+
			
 
				+!!! note
			
 
				+    Please use the Python version below 3.12 for compile.
			
 
				+
			
 
				+If you don't have, please use the below commands to build yout environment:
			
 
				+
			
 
				+```bash
			
 
				+sudo apt-get install portaudio19-dev
			
 
				+
			
 
				+pip install -e .[stable]
			
 
				+```
			
 
				+
			
 
				+## Launch The Agent Demo.
			
 
				+
			
 
				+To build fish-agent, please use the command below under the main folder:
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
			
 
				+```
			
 
				+
			
 
				+The `--compile` args only support Python < 3.12 , which will greatly speed up the token generation.
			
 
				+
			
 
				+It won't compile at once (remember).
			
 
				+
			
 
				+Then open another terminal and use the command:
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.e2e_webui
			
 
				+```
			
 
				+
			
 
				+This will create a Gradio WebUI on the device.
			
 
				+
			
 
				+When you first use the model, it will come to compile (if the `--compile` is True) for a short time, so please wait with patience.
			
 
				+
			
 
				+## Gradio Webui
			
 
				+<p align="center">
			
 
				+   <img src="../assets/figs/agent_gradio.png" width="75%">
			
 
				+</p>
			
 
				+
			
 
				+Have a good time!
			
 
				+
			
 
				+## Performance
			
 
				+
			
 
				+Under our test, a 4060 laptop just barely runs, but is very stretched, which is only about 8 tokens/s. The 4090 is around 95 tokens/s under compile, which is what we recommend.
			
 
				+
			
 
				+# About Agent
			
 
				+
			
 
				+The demo is an early alpha test version, the inference speed needs to be optimised, and there are a lot of bugs waiting to be fixed. If you've found a bug or want to fix it, we'd be very happy to receive an issue or a pull request.
			
--- a/docs/ko/index.md
+++ b/docs/ko/index.md
@@ -1,4 +1,4 @@
 
				-# Introduction
			
 
				+# 소개
			
 
				 
			
 
				 <div>
			
 
				 <a target="_blank" href="https://discord.gg/Es5qTB9BcN">
			
--- a/docs/zh/index.md
+++ b/docs/zh/index.md
@@ -12,7 +12,7 @@
 
				 </a>
			
 
				 </div>
			
 
				 
			
 
				-!!! warning
			
 
				+!!! warning "警告"
			
 
				     我们不对代码库的任何非法使用承担任何责任. 请参阅您当地关于 DMCA (数字千年法案) 和其他相关法律法规. <br/>
			
 
				     此代码库与所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
			
 
				 
			
--- a/docs/zh/start_agent.md
+++ b/docs/zh/start_agent.md
@@ -0,0 +1,83 @@
 
				+# 启动 Agent
			
 
				+
			
 
				+## 要求
			
 
				+
			
 
				+- GPU 显存: 至少 8GB（在量化的条件下），推荐 16GB 及以上
			
 
				+- 硬盘使用量: 10GB
			
 
				+
			
 
				+## 下载模型
			
 
				+
			
 
				+你可以执行下面的语句来获取模型:
			
 
				+
			
 
				+```bash
			
 
				+huggingface-cli download fishaudio/fish-agent-v0.1-3b --local-dir checkpoints/fish-agent-v0.1-3b
			
 
				+```
			
 
				+
			
 
				+如果你处于国内网络，首先执行:
			
 
				+
			
 
				+```bash
			
 
				+export HF_ENDPOINT=https://hf-mirror.com
			
 
				+```
			
 
				+
			
 
				+把他们放进名为 'checkpoints' 的文件夹内。
			
 
				+
			
 
				+你同样需要 fish-speech 的模型，关于如何获取 fish-speech 模型请查看[inference](inference.md)。
			
 
				+
			
 
				+完成后你的 checkpoints 文件夹中会有两个子文件夹：`checkpoints/fish-speech-1.4` 和 `checkpoints/fish-agent-v0.1-3b`。
			
 
				+
			
 
				+## Environment Prepare
			
 
				+
			
 
				+如果你已经有了 Fish-Speech 环境，你可以在安装下面的包的前提下直接使用：
			
 
				+
			
 
				+```bash
			
 
				+pip install cachetools
			
 
				+```
			
 
				+
			
 
				+!!! note
			
 
				+请使用小于 3.12 的 python 版本使 compile 可用
			
 
				+
			
 
				+如果你没有 Fish-Speech 环境，请执行下面的语句来构造你的环境：
			
 
				+
			
 
				+```bash
			
 
				+sudo apt-get install portaudio19-dev
			
 
				+
			
 
				+pip install -e .[stable]
			
 
				+```
			
 
				+
			
 
				+## 链接 Agent.
			
 
				+
			
 
				+你需要使用以下指令来构建 fish-agent
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.api --llama-checkpoint-path checkpoints/fish-agent-v0.1-3b/ --mode agent --compile
			
 
				+```
			
 
				+
			
 
				+`--compile`只能在小于 3.12 版本的 Python 使用，这个功能可以极大程度上提高生成速度。
			
 
				+
			
 
				+你需要哦注意 compile 需要进行一段时间.
			
 
				+
			
 
				+然后启动另一个终端并执行:
			
 
				+
			
 
				+```bash
			
 
				+python -m tools.e2e_webui
			
 
				+```
			
 
				+
			
 
				+这会在设备上创建一个 Gradio WebUI。
			
 
				+
			
 
				+每当进行第一轮对话的时候，模型需要 compile 一段时间，请耐心等待
			
 
				+
			
 
				+## Gradio Webui
			
 
				+
			
 
				+<p align="center">
			
 
				+   <img src="../assets/figs/agent_gradio.png" width="75%">
			
 
				+</p>
			
 
				+
			
 
				+玩得开心！
			
 
				+
			
 
				+## Performance
			
 
				+
			
 
				+在我们的测试环境下， 4060 laptop GPU 只能刚刚运行该模型，只有大概 8 tokens/s。 4090 CPU 可以在编译后达到 95 tokens/s，我们推荐使用至少 4080 以上级别的 GPU 来达到较好体验。
			
 
				+
			
 
				+# About Agent
			
 
				+
			
 
				+该模型仍处于测试阶段。如果你发现了问题，请给我们提 issue 或者 pull request，我们非常感谢。
			
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -12,6 +12,7 @@ copyright: Copyright &copy; 2023-2024 by Fish Audio
 
				 
			
 
				 theme:
			
 
				   name: material
			
 
				+  favicon: assets/figs/logo-circle.png
			
 
				   language: en
			
 
				   features:
			
 
				     - content.action.edit
			
@@ -54,6 +55,13 @@ theme:
 
				       font:
			
 
				         code: Roboto Mono
			
 
				 
			
 
				+nav:
			
 
				+  - Introduction: index.md
			
 
				+  - Finetune: finetune.md
			
 
				+  - Inference: inference.md
			
 
				+  - Start Agent: start_agent.md
			
 
				+  - Samples: samples.md
			
 
				+
			
 
				 # Plugins
			
 
				 plugins:
			
 
				   - search:
			
@@ -63,6 +71,7 @@ plugins:
 
				         - zh
			
 
				         - ja
			
 
				         - pt
			
 
				+        - ko
			
 
				   - i18n:
			
 
				       docs_structure: folder
			
 
				       languages:
			
@@ -73,12 +82,36 @@ plugins:
 
				         - locale: zh
			
 
				           name: 简体中文
			
 
				           build: true
			
 
				+          nav:
			
 
				+            - 介绍: zh/index.md
			
 
				+            - 微调: zh/finetune.md
			
 
				+            - 推理: zh/inference.md
			
 
				+            - 启动Agent: zh/启动Agent.md
			
 
				+            - 例子: zh/samples.md
			
 
				         - locale: ja
			
 
				           name: 日本語
			
 
				           build: true
			
 
				+          nav:
			
 
				+            - Fish Speech の紹介: ja/index.md
			
 
				+            - 微調整: ja/finetune.md
			
 
				+            - 推論: ja/inference.md
			
 
				+            - サンプル: ja/samples.md
			
 
				         - locale: pt
			
 
				           name: Português (Brasil)
			
 
				           build: true
			
 
				+          nav:
			
 
				+            - Introdução: pt/index.md
			
 
				+            - Ajuste Fino: pt/finetune.md
			
 
				+            - Inferência: pt/inference.md
			
 
				+            - Amostras: pt/samples.md
			
 
				+        - locale: ko
			
 
				+          name: 한국어
			
 
				+          build: true
			
 
				+          nav:
			
 
				+            - 소개: ko/index.md
			
 
				+            - 파인튜닝: ko/finetune.md
			
 
				+            - 추론: ko/inference.md
			
 
				+            - 샘플: ko/samples.md
			
 
				 
			
 
				 markdown_extensions:
			
 
				   - pymdownx.highlight:
			
--- a/tools/e2e_webui.py
+++ b/tools/e2e_webui.py
@@ -138,16 +138,28 @@ def create_demo():
 
				                     type="messages",
			
 
				                 )
			
 
				 
			
 
				+                # notes = gr.Markdown(
			
 
				+                #     """
			
 
				+                # # Fish Agent
			
 
				+                # 1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
			
 
				+                # 2. 你可以在我们的官方仓库找到代码以及权重，但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
			
 
				+                # 3. Demo为早期灰度测试版本，推理速度尚待优化.
			
 
				+                # # 特色
			
 
				+                # 1. 该模型自动集成ASR与TTS部分，不需要外挂其它模型，即真正的端到端，而非三段式(ASR+LLM+TTS).
			
 
				+                # 2. 模型可以使用reference audio控制说话音色.
			
 
				+                # 3. 可以生成具有较强情感与韵律的音频.
			
 
				+                # """
			
 
				+                # )
			
 
				                 notes = gr.Markdown(
			
 
				                     """
			
 
				-                # Fish Agent
			
 
				-                1. 此Demo为Fish Audio自研端到端语言模型Fish Agent 3B版本.
			
 
				-                2. 你可以在我们的官方仓库找到代码以及权重，但是相关内容全部基于 CC BY-NC-SA 4.0 许可证发布.
			
 
				-                3. Demo为早期灰度测试版本，推理速度尚待优化.
			
 
				-                # 特色
			
 
				-                1. 该模型自动集成ASR与TTS部分，不需要外挂其它模型，即真正的端到端，而非三段式(ASR+LLM+TTS).
			
 
				-                2. 模型可以使用reference audio控制说话音色.
			
 
				-                3. 可以生成具有较强情感与韵律的音频.
			
 
				+                    # Fish Agent
			
 
				+                    1. This demo is Fish Audio's self-researh end-to-end language model, Fish Agent version 3B.
			
 
				+                    2. You can find the code and weights in our official repo in [gitub](https://github.com/fishaudio/fish-speech) and [hugging face](https://huggingface.co/fishaudio/fish-agent-v0.1-3b), but the content is released under a CC BY-NC-SA 4.0 licence.
			
 
				+                    3. The demo is an early alpha test version, the inference speed needs to be optimised.
			
 
				+                    # Features
			
 
				+                    1. The model automatically integrates ASR and TTS parts, no need to plug-in other models, i.e., true end-to-end, not three-stage (ASR+LLM+TTS).
			
 
				+                    2. The model can use reference audio to control the speech timbre. 
			
 
				+                    3. The model can generate speech with strong emotion.
			
 
				                 """
			
 
				                 )
			
 
				 
			
@@ -160,7 +172,7 @@ def create_demo():
 
				                 )
			
 
				                 sys_text_input = gr.Textbox(
			
 
				                     label="What is your assistant's role?",
			
 
				-                    value='您是由 Fish Audio 设计的语音助手，提供端到端的语音交互，实现无缝用户体验。首先转录用户的语音，然后使用以下格式回答："Question: [用户语音]\n\nResponse: [你的回答]\n"。',
			
 
				+                    value="You are a voice assistant created by Fish Audio, offering end-to-end voice interaction for a seamless user experience. You are required to first transcribe the user's speech, then answer it in the following format: 'Question: [USER_SPEECH]\n\nAnswer: [YOUR_RESPONSE]\n'. You are required to use the following voice in this conversation.",
			
 
				                     type="text",
			
 
				                 )
			
 
				                 audio_input = gr.Audio(