Просмотр исходного кода

Big update on V1.5 (#920)

* Fix Filename.

* Remove old inference interface.

* Remove Chinese Text Normalization.

* Remove Windows bat scripts.

* Update docs (Chinese for example).

* Remove windows non-professional tutorials and update demo video link.

* Remove ASR tools.

* Remove old tokenizer builder.

* Update docs.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
PoTaTo 1 год назад
Родитель
Сommit
626df17a0e
42 измененных файлов с 35 добавлено и 2912 удалено
  1. 1 1
      README.md
  2. 2 6
      docs/README.ja.md
  3. 2 2
      docs/README.ko.md
  4. 2 6
      docs/README.pt-BR.md
  5. 3 1
      docs/README.zh.md
  6. 5 48
      docs/en/index.md
  7. 4 46
      docs/ja/index.md
  8. 4 48
      docs/ko/index.md
  9. 5 44
      docs/pt/index.md
  10. 5 43
      docs/zh/index.md
  11. 0 9
      docs/zh/inference.md
  12. 1 6
      fish_speech/inference_engine/__init__.py
  13. 0 10
      fish_speech/inference_engine/utils.py
  14. 0 114
      fish_speech/text/chn_text_norm/.gitignore
  15. 0 36
      fish_speech/text/chn_text_norm/README.md
  16. 0 0
      fish_speech/text/chn_text_norm/__init__.py
  17. 0 172
      fish_speech/text/chn_text_norm/basic_class.py
  18. 0 30
      fish_speech/text/chn_text_norm/basic_constant.py
  19. 0 342
      fish_speech/text/chn_text_norm/basic_util.py
  20. 0 32
      fish_speech/text/chn_text_norm/cardinal.py
  21. 0 75
      fish_speech/text/chn_text_norm/date.py
  22. 0 32
      fish_speech/text/chn_text_norm/digit.py
  23. 0 35
      fish_speech/text/chn_text_norm/fraction.py
  24. 0 43
      fish_speech/text/chn_text_norm/money.py
  25. 0 33
      fish_speech/text/chn_text_norm/percentage.py
  26. 0 51
      fish_speech/text/chn_text_norm/telephone.py
  27. 0 177
      fish_speech/text/chn_text_norm/text.py
  28. 0 184
      install_env.bat
  29. 0 50
      run_cmd.bat
  30. 0 97
      start.bat
  31. 0 2
      tools/api_client.py
  32. 0 0
      tools/export_onnx.py
  33. 0 17
      tools/llama/generate.py
  34. 0 57
      tools/llama/rebuild_tokenizer.py
  35. 0 59
      tools/sensevoice/README.md
  36. 0 0
      tools/sensevoice/__init__.py
  37. 0 573
      tools/sensevoice/auto_model.py
  38. 0 332
      tools/sensevoice/fun_asr.py
  39. 0 61
      tools/sensevoice/vad_utils.py
  40. 0 17
      tools/vqgan/inference.py
  41. 1 19
      tools/webui/__init__.py
  42. 0 2
      tools/webui/inference.py

+ 1 - 1
README.md

@@ -84,7 +84,7 @@ We do not hold any responsibility for any illegal usage of the codebase. Please
 
 ## Videos
 
-#### V1.4 Demo Video: [Youtube](https://www.youtube.com/watch?v=Ghc8cJdQyKQ)
+#### V1.5 Demo Video: [Watch the video on X (Twitter).](https://x.com/FishAudio/status/1864370933496205728)
 
 ## Documents
 

+ 2 - 6
docs/README.ja.md

@@ -30,7 +30,7 @@
     </a>
 </div>
 
-このコードベースとすべてのモデルは、CC-BY-NC-SA-4.0 ライセンスの下でリリースされています。詳細については、[LICENSE](LICENSE)を参照してください。
+このコードリポジトリはApache 2.0ライセンスの下で公開されており、モデルはCC-BY-NC-SA-4.0ライセンスの下で公開されています。詳細については[LICENSE](../LICENSE)をご参照ください。
 
 ---
 
@@ -59,11 +59,7 @@
 
 ## ビデオ
 
-#### V1.4 デモビデオ: https://www.bilibili.com/video/BV1pu46eVEk7
-
-#### V1.2 デモビデオ: https://www.bilibili.com/video/BV1wz421B71D
-
-#### V1.1 デモビデオ: https://www.bilibili.com/video/BV1zJ4m1K7cj
+#### V1.5 デモビデオ: [Watch the video on X (Twitter).](https://x.com/FishAudio/status/1864370933496205728)
 
 ## ドキュメント
 

+ 2 - 2
docs/README.ko.md

@@ -30,7 +30,7 @@
     </a>
 </div>
 
-이 코드베이스와 모든 모델은 CC-BY-NC-SA-4.0 라이선스에 따라 배포됩니다. 자세한 내용은 [LICENSE](LICENSE)를 참조하시길 바랍니다.
+이 코드 저장소는 Apache 2.0 라이선스 하에 배포되며, 모델은 CC-BY-NC-SA-4.0 라이선스 하에 배포됩니다. 자세한 내용은 [LICENSE](../LICENSE)를 참조하십시오.
 
 ---
 
@@ -66,7 +66,7 @@
 
 ## 영상
 
-#### V1.4 데모 영상: [Youtube](https://www.youtube.com/watch?v=Ghc8cJdQyKQ)
+#### V1.5 데모 영상: [Watch the video on X (Twitter).](https://x.com/FishAudio/status/1864370933496205728)
 
 ## 문서
 

+ 2 - 6
docs/README.pt-BR.md

@@ -31,7 +31,7 @@
     </a>
 </div>
 
-Este código-fonte e os modelos são publicados sob a licença CC-BY-NC-SA-4.0. Consulte [LICENSE](LICENSE) para mais detalhes.
+Este repositório de código é disponibilizado sob a licença Apache 2.0, e o modelo sob a licença CC-BY-NC-SA-4.0. Consulte [LICENSE](../LICENSE) para mais detalhes.
 
 ---
 
@@ -67,11 +67,7 @@ Não nos responsabilizamos por qualquer uso ilegal do código-fonte. Consulte as
 
 ## Vídeos
 
-#### 1.4 Introdução: https://www.bilibili.com/video/BV1pu46eVEk7
-
-#### 1.2 Introdução: https://www.bilibili.com/video/BV1wz421B71D
-
-#### 1.1 Apresentação Técnica: https://www.bilibili.com/video/BV1zJ4m1K7cj
+#### 1.5 Introdução: [Watch the video on X (Twitter).](https://x.com/FishAudio/status/1864370933496205728)
 
 ## Documentação
 

+ 3 - 1
docs/README.zh.md

@@ -33,7 +33,7 @@
 
 </div>
 
-此代码库模型根据 CC-BY-NC-SA-4.0 许可证发布。请参阅 [LICENSE](LICENSE) 了解更多细节.
+此代码库根据 Apache 2.0 许可证发布,模型根据 CC-BY-NC-SA-4.0 许可证发布。请参阅 [LICENSE](../LICENSE) 了解更多细节.
 
 ---
 
@@ -62,6 +62,8 @@
 
 ## 视频
 
+#### 1.5 介绍: https://www.bilibili.com/video/BV1EKiDYBE4o
+
 #### 1.4 介绍: https://www.bilibili.com/video/BV1pu46eVEk7
 
 #### 1.2 介绍: https://www.bilibili.com/video/BV1wz421B71D

+ 5 - 48
docs/en/index.md

@@ -27,6 +27,10 @@
 
 ## Windows Setup
 
+!!! info "Attention"
+
+    We strongly suggest non-professional windows users use our official GUI to run the project. [GUI is here](https://github.com/AnyaCoder/fish-speech-gui).
+
 Professional Windows users may consider using WSL2 or Docker to run the codebase.
 
 ```bash
@@ -44,54 +48,6 @@ pip3 install -e .
 pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl
 ```
 
-Non-professional Windows users can consider the following basic methods to run the project without a Linux environment (with model compilation capabilities, i.e., `torch.compile`):
-
-1. Extract the project package.
-2. Click `install_env.bat` to install the environment.
-3. If you want to enable compilation acceleration, follow this step:
-    1. Download the LLVM compiler from the following links:
-        - [LLVM-17.0.6 (Official Site Download)](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - [LLVM-17.0.6 (Mirror Site Download)](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - After downloading `LLVM-17.0.6-win64.exe`, double-click to install, select an appropriate installation location, and most importantly, check the `Add Path to Current User` option to add the environment variable.
-        - Confirm that the installation is complete.
-    2. Download and install the Microsoft Visual C++ Redistributable to solve potential .dll missing issues:
-        - [MSVC++ 14.40.33810.0 Download](https://aka.ms/vs/17/release/vc_redist.x64.exe)
-    3. Download and install Visual Studio Community Edition to get MSVC++ build tools and resolve LLVM's header file dependencies:
-        - [Visual Studio Download](https://visualstudio.microsoft.com/zh-hans/downloads/)
-        - After installing Visual Studio Installer, download Visual Studio Community 2022.
-        - As shown below, click the `Modify` button and find the `Desktop development with C++` option to select and download.
-    4. Download and install [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64)
-4. Double-click `start.bat` to open the training inference WebUI management interface. If needed, you can modify the `API_FLAGS` as prompted below.
-
-!!! info "Optional"
-
-	Want to start the inference WebUI?
-
-    Edit the `API_FLAGS.txt` file in the project root directory and modify the first three lines as follows:
-    ```
-     --infer
-     # --api
-     # --listen ...
-     ...
-    ```
-
-!!! info "Optional"
-
-	Want to start the API server?
-
-    Edit the `API_FLAGS.txt` file in the project root directory and modify the first three lines as follows:
-
-    ```
-    # --infer
-    --api
-    --listen ...
-    ...
-    ```
-
-!!! info "Optional"
-
-	Double-click `run_cmd.bat` to enter the conda/python command line environment of this project.
-
 ## Linux Setup
 
 See [pyproject.toml](../../pyproject.toml) for details.
@@ -195,6 +151,7 @@ pip install -e .[stable]
 
 ## Changelog
 
+- 2024/12/03: Updated Fish-Speech to 1.5 version, supports more languages, and reaches SOTA in the Open-Source field.
 - 2024/09/10: Updated Fish-Speech to 1.4 version, with an increase in dataset size and a change in the quantizer's n_groups from 4 to 8.
 - 2024/07/02: Updated Fish-Speech to 1.2 version, remove VITS Decoder, and greatly enhanced zero-shot ability.
 - 2024/05/10: Updated Fish-Speech to 1.1 version, implement VITS decoder to reduce WER and improve timbre similarity.

+ 4 - 46
docs/ja/index.md

@@ -27,6 +27,9 @@
 
 ## Windowsセットアップ
 
+!!! info "注意"
+    Windowsの専門ユーザー以外の方には、GUIを使用してプロジェクトを実行することを強くお勧めします。[GUIはこちら](https://github.com/AnyaCoder/fish-speech-gui).
+
 プロフェッショナルなWindowsユーザーは、WSL2またはDockerを使用してコードベースを実行することを検討してください。
 
 ```bash
@@ -44,52 +47,6 @@ pip3 install -e .
 pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl
 ```
 
-非プロフェッショナルなWindowsユーザーは、Linux環境なしでプロジェクトを実行するための以下の基本的な方法を検討できます(モデルコンパイル機能、つまり`torch.compile`を使用可能):
-
-1. プロジェクトパッケージを解凍する。
-2. `install_env.bat`をクリックして環境をインストールする。
-3. コンパイルアクセラレーションを有効にしたい場合は、次のステップに従ってください:
-    1. 以下のリンクからLLVMコンパイラをダウンロード:
-        - [LLVM-17.0.6(公式サイトのダウンロード)](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - [LLVM-17.0.6(ミラーサイトのダウンロード)](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - `LLVM-17.0.6-win64.exe`をダウンロードした後、ダブルクリックしてインストールし、適切なインストール場所を選択し、最も重要なのは`Add Path to Current User`オプションを選択して環境変数を追加することです。
-        - インストールが完了したことを確認する。
-    2. 欠落している .dll の問題を解決するため、Microsoft Visual C++ Redistributable をダウンロードしてインストールする:
-        - [MSVC++ 14.40.33810.0 ダウンロード](https://aka.ms/vs/17/release/vc_redist.x64.exe)
-    3. Visual Studio Community Editionをダウンロードして、MSVC++ビルドツールを取得し、LLVMのヘッダーファイルの依存関係を解決する:
-        - [Visual Studio ダウンロード](https://visualstudio.microsoft.com/ja/downloads/)
-        - Visual Studio Installerをインストールした後、Visual Studio Community 2022をダウンロード。
-        - 下記のように、`Modify`ボタンをクリックし、`C++によるデスクトップ開発`オプションを選択してダウンロード。
-        - <img src="../assets/figs/VS_1.jpg"/>
-    4. [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64)をダウンロードしてインストールする。
-4. `start.bat`をダブルクリックして、トレーニング推論WebUI管理インターフェースを開きます。必要に応じて、以下に示すように`API_FLAGS`を修正できます。
-
-
-!!! info "オプション"
-    推論WebUIを起動しますか?
-    プロジェクトのルートディレクトリにある `API_FLAGS.txt` ファイルを編集し、最初の3行を次のように変更します:
-    ```
-    --infer
-    # --api
-    # --listen ...
-    ...
-    ```
-
-!!! info "オプション"
-    APIサーバーを起動しますか?
-    プロジェクトのルートディレクトリにある `API_FLAGS.txt` ファイルを編集し、最初の3行を次のように変更します:
-    ```
-    # --infer
-    --api
-    --listen ...
-    ...
-    ```
-
-!!! info "オプション"
-    `run_cmd.bat` をダブルクリックして、このプロジェクトの conda/python コマンドライン環境に入ります。
-
-
-
 ## Linux セットアップ
 
 詳細については、[pyproject.toml](../../pyproject.toml)  を参照してください。
@@ -192,6 +149,7 @@ pip install -e .[stable]
 
 ## 変更履歴
 
+- 2024/12/03: Fish-Speech を 1.5にアップデートし、より多くの言語をサポートするようになりました。オープンソース領域ではSOTA(最先端)となっています。
 - 2024/09/10: Fish-Speech を Ver.1.4 に更新し、データセットのサイズを増加させ、quantizer n_groups を 4 から 8 に変更しました。
 - 2024/07/02: Fish-Speech を Ver.1.2 に更新し、VITS デコーダーを削除し、ゼロショット能力を大幅に強化しました。
 - 2024/05/10: Fish-Speech を Ver.1.1 に更新し、VITS デコーダーを実装して WER を減少させ、音色の類似性を向上させました。

+ 4 - 48
docs/ko/index.md

@@ -27,6 +27,9 @@
 
 ## Windows 설정
 
+!!! info "주의"
+    Windows 전문가가 아닌 사용자는 GUI를 통해 프로젝트를 실행할 것을 강력히 권장합니다. [GUI는 여기에서](https://github.com/AnyaCoder/fish-speech-gui) 확인하세요.
+
 고급 Windows 사용자는 WSL2 또는 Docker를 사용하여 코드베이스를 실행하는 것을 고려할 수 있습니다.
 
 ```bash
@@ -44,54 +47,6 @@ pip3 install -e .
 pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl
 ```
 
-비전문 Windows 사용자는 Linux 환경 없이 프로젝트를 실행할 수 있는 다음 기본 방법을 고려할 수 있습니다 (모델 컴파일 기능 포함, 즉 `torch.compile`):
-
-1. 프로젝트 패키지 추출.
-2. `install_env.bat`을 클릭하여 환경 설치.
-3. 컴파일 가속을 활성화하려면 아래 단계를 따르세요:
-    1. LLVM 컴파일러 다운로드:
-        - [LLVM-17.0.6 (공식 사이트)](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - [LLVM-17.0.6 (미러 사이트)](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - `LLVM-17.0.6-win64.exe`를 다운로드 후 더블클릭하여 설치하고, 설치 경로 선택 시 `Add Path to Current User` 옵션을 체크하여 환경 변수를 추가합니다.
-        - 설치가 완료되었는지 확인합니다.
-    2. Microsoft Visual C++ 재배포 가능 패키지를 다운로드하여 .dll 누락 문제 해결:
-        - [MSVC++ 14.40.33810.0 다운로드](https://aka.ms/vs/17/release/vc_redist.x64.exe)
-    3. Visual Studio Community Edition을 다운로드하여 LLVM의 헤더 파일 의존성을 해결:
-        - [Visual Studio 다운로드](https://visualstudio.microsoft.com/zh-hans/downloads/)
-        - Visual Studio Installer를 설치한 후 Visual Studio Community 2022를 다운로드.
-        - `Desktop development with C++` 옵션을 선택하여 설치.
-    4. [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64) 다운로드 및 설치.
-4. `start.bat`을 더블 클릭하여 훈련 추론 WebUI 관리 인터페이스를 엽니다. 필요한 경우 아래 지침에 따라 `API_FLAGS`를 수정할 수 있습니다.
-
-!!! info "Optional"
-
-	추론을 위해 WebUI를 사용하고자 하시나요?
-
-    프로젝트 루트 디렉토리의 `API_FLAGS.txt` 파일을 편집하고 첫 세 줄을 아래와 같이 수정하세요:
-    ```
-     --infer
-     # --api
-     # --listen ...
-     ...
-    ```
-
-!!! info "Optional"
-
-	API 서버를 시작하고 싶으신가요?
-
-    프로젝트 루트 디렉토리의 `API_FLAGS.txt` 파일을 편집하고 첫 세 줄을 아래와 같이 수정하세요:
-
-    ```
-    # --infer
-    --api
-    --listen ...
-    ...
-    ```
-
-!!! info "Optional"
-
-	`run_cmd.bat`을 더블 클릭하여 이 프로젝트의 conda/python 명령줄 환경에 진입할 수 있습니다.
-
 ## Linux 설정
 
 [pyproject.toml](../../pyproject.toml)에서 자세한 내용을 확인하세요.
@@ -193,6 +148,7 @@ pip install -e .[stable]
 
 ## 변경 사항
 
+- 2024/12/03: Fish-Speech를 1.5 로 업데이트하여 더 많은 언어를 지원하게 되었으며, 오픈소스 영역에서 SOTA(최첨단 기술)에 속합니다.
 - 2024/09/10: Fish-Speech 1.4 버전으로 업데이트, 데이터셋 크기 증가 및 양자화기의 n_groups를 4에서 8로 변경.
 - 2024/07/02: Fish-Speech 1.2 버전으로 업데이트, VITS 디코더 제거 및 제로샷 능력 크게 향상.
 - 2024/05/10: Fish-Speech 1.1 버전으로 업데이트, WER 감소 및 음색 유사성을 개선하기 위해 VITS 디코더 구현.

+ 5 - 44
docs/pt/index.md

@@ -27,6 +27,9 @@
 
 ## Configuração do Windows
 
+!!! info "Aviso"
+    Recomendamos fortemente que usuários que não sejam especialistas em Windows usem a GUI para executar o projeto. [A GUI está aqui](https://github.com/AnyaCoder/fish-speech-gui).
+    
 Usuários profissionais do Windows podem considerar o uso do WSL2 ou Docker para executar a base de código.
 
 ```bash
@@ -44,50 +47,6 @@ pip3 install -e .
 pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl
 ```
 
-Usuários não profissionais do Windows podem considerar os seguintes métodos básicos para executar o projeto sem um ambiente Linux (com capacidades de compilação de modelo, ou seja, `torch.compile`):
-
-1. Extraia o pacote do projeto.
-2. Clique em `install_env.bat` para instalar o ambiente.
-3. Se você quiser ativar a aceleração de compilação, siga estas etapas:
-    1. Baixe o compilador LLVM nos seguintes links:
-        - [LLVM-17.0.6 (Download do site oficial)](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - [LLVM-17.0.6 (Download do site espelho)](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - Após baixar o `LLVM-17.0.6-win64.exe`, clique duas vezes para instalar, selecione um local de instalação apropriado e, o mais importante, marque a opção `Add Path to Current User` para adicionar a variável de ambiente.
-        - Confirme que a instalação foi concluída.
-    2. Baixe e instale o Microsoft Visual C++ Redistributable para resolver possíveis problemas de arquivos .dll ausentes:
-        - [Download do MSVC++ 14.40.33810.0](https://aka.ms/vs/17/release/vc_redist.x64.exe)
-    3. Baixe e instale o Visual Studio Community Edition para obter as ferramentas de compilação do MSVC++ e resolver as dependências dos arquivos de cabeçalho do LLVM:
-        - [Download do Visual Studio](https://visualstudio.microsoft.com/pt-br/downloads/)
-        - Após instalar o Visual Studio Installer, baixe o Visual Studio Community 2022.
-        - Conforme mostrado abaixo, clique no botão `Modificar`, encontre a opção `Desenvolvimento de área de trabalho com C++` e selecione para fazer o download.
-    4. Baixe e instale o [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64)
-4. Clique duas vezes em `start.bat` para abrir a interface de gerenciamento WebUI de inferência de treinamento. Se necessário, você pode modificar as `API_FLAGS` conforme mostrado abaixo.
-
-!!! info "Opcional"
-    Você quer iniciar o WebUI de inferência?
-    Edite o arquivo `API_FLAGS.txt` no diretório raiz do projeto e modifique as três primeiras linhas como segue:
-    ```
-    --infer
-    # --api
-    # --listen ...
-    ...
-    ```
-
-!!! info "Opcional"
-    Você quer iniciar o servidor de API?
-    Edite o arquivo `API_FLAGS.txt` no diretório raiz do projeto e modifique as três primeiras linhas como segue:
-
-    ```
-    # --infer
-    --api
-    --listen ...
-    ...
-    ```
-
-!!! info "Opcional"
-    Clique duas vezes em `run_cmd.bat` para entrar no ambiente de linha de comando conda/python deste projeto.
-
-
 ## Configuração para Linux
 
 Para mais detalhes, consulte [pyproject.toml](../../pyproject.toml).
@@ -188,6 +147,8 @@ pip install -e .[stable]
     Se estiver implantando em um servidor, substitua localhost pelo IP do seu servidor.
 
 ## Histórico de Alterações
+
+- 12/03/2024: Atualização do Fish-Speech para 1.5, com suporte para mais idiomas, sendo considerado SOTA (estado da arte) no campo de código aberto.
 - 10/09/2024: Fish-Speech atualizado para a versão 1.4, aumentado o tamanho do conjunto de dados, quantizer n_groups 4 -> 8.
 - 02/07/2024: Fish-Speech atualizado para a versão 1.2, removido o Decodificador VITS e aprimorado consideravelmente a capacidade de zero-shot.
 - 10/05/2024: Fish-Speech atualizado para a versão 1.1, implementado o decodificador VITS para reduzir a WER e melhorar a similaridade de timbre.

+ 5 - 43
docs/zh/index.md

@@ -27,6 +27,10 @@
 
 ## Windows 配置
 
+!!! info "注意"
+    我们强烈建议非Windows专业用户使用GUI运行该项目。[GUI在这里](https://github.com/AnyaCoder/fish-speech-gui).
+
+
 Windows 专业用户可以考虑 WSL2 或 docker 来运行代码库。
 
 ```bash
@@ -44,49 +48,6 @@ pip3 install -e .
 pip install https://github.com/AnyaCoder/fish-speech/releases/download/v0.1.0/triton_windows-0.1.0-py3-none-any.whl
 ```
 
-Windows 非专业用户可考虑以下为免 Linux 环境的基础运行方法(附带模型编译功能,即 `torch.compile`):
-
-1. 解压项目压缩包。
-2. 点击 `install_env.bat` 安装环境。
-3. 若需要开启编译加速则执行这一步:
-    1. 使用如下链接下载 LLVM 编译器。
-        - [LLVM-17.0.6(原站站点下载)](https://huggingface.co/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - [LLVM-17.0.6(镜像站点下载)](https://hf-mirror.com/fishaudio/fish-speech-1/resolve/main/LLVM-17.0.6-win64.exe?download=true)
-        - 下载完 `LLVM-17.0.6-win64.exe` 后,双击进行安装,选择合适的安装位置,最重要的是勾选 `Add Path to Current User` 添加环境变量。
-        - 确认安装完成。
-    2. 下载安装 Microsoft Visual C++ 可再发行程序包,解决潜在 .dll 丢失问题。
-        - [MSVC++ 14.40.33810.0 下载](https://aka.ms/vs/17/release/vc_redist.x64.exe)
-    3. 下载安装 Visual Studio 社区版以获取 MSVC++ 编译工具, 解决 LLVM 的头文件依赖问题。
-        - [Visual Studio 下载](https://visualstudio.microsoft.com/zh-hans/downloads/)
-        - 安装好 Visual Studio Installer 之后,下载 Visual Studio Community 2022
-        - 如下图点击`修改`按钮,找到`使用C++的桌面开发`项,勾选下载
-    4. 下载安装 [CUDA Toolkit 12.x](https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Windows&target_arch=x86_64)
-4. 双击 `start.bat` 打开训练推理 WebUI 管理界面. 如有需要,可照下列提示修改`API_FLAGS`.
-
-!!! info "可选"
-
-    想启动 推理 WebUI 界面?编辑项目根目录下的 `API_FLAGS.txt`, 前三行修改成如下格式:
-    ```
-    --infer
-    # --api
-    # --listen ...
-    ...
-    ```
-
-!!! info "可选"
-
-    想启动 API 服务器?编辑项目根目录下的 `API_FLAGS.txt`, 前三行修改成如下格式:
-    ```
-    # --infer
-    --api
-    --listen ...
-    ...
-    ```
-
-!!! info "可选"
-
-    双击 `run_cmd.bat` 进入本项目的 conda/python 命令行环境
-
 ## Linux 配置
 
 有关详细信息,请参见 [pyproject.toml](../../pyproject.toml)。
@@ -196,6 +157,7 @@ pip install -e .[stable]
 
 ## 更新日志
 
+- 2024/12/03: 更新了 Fish-Speech 到 1.5,增加更多支持语言,在开源领域属于SOTA.
 - 2024/09/10: 更新了 Fish-Speech 到 1.4, 增加了数据集大小, quantizer n_groups 4 -> 8.
 - 2024/07/02: 更新了 Fish-Speech 到 1.2 版本,移除 VITS Decoder,同时极大幅度提升 zero-shot 能力.
 - 2024/05/10: 更新了 Fish-Speech 到 1.1 版本,引入了 VITS Decoder 来降低口胡和提高音色相似度.

+ 0 - 9
docs/zh/inference.md

@@ -29,9 +29,6 @@ HF_ENDPOINT=https://hf-mirror.com huggingface-cli download fishaudio/fish-speech
 !!! note
     如果你打算让模型随机选择音色, 你可以跳过这一步.
 
-!!! warning "未来版本警告"
-    我们保留了从原来路径(tools/vqgan/infernce.py)访问的接口,但是这个接口可能在之后几个版本被删除,请尽快更改你的代码。
-
 ```bash
 python fish_speech/models/vqgan/inference.py \
     -i "paimon.wav" \
@@ -42,9 +39,6 @@ python fish_speech/models/vqgan/inference.py \
 
 ### 2. 从文本生成语义 token:
 
-!!! warning "未来版本警告"
-    我们保留了从原来路径(tools/llama/generate.py)访问的接口,但是这个接口可能在之后几个版本被删除,请尽快更改你的代码。
-
 ```bash
 python fish_speech/models/text2semantic/inference.py \
     --text "要转换的文本" \
@@ -68,9 +62,6 @@ python fish_speech/models/text2semantic/inference.py \
 
 #### VQGAN 解码
 
-!!! warning "未来版本警告"
-    我们保留了从原来路径(tools/vqgan/infernce.py)访问的接口,但是这个接口可能在之后几个版本被删除,请尽快更改你的代码。
-
 ```bash
 python fish_speech/models/vqgan/inference.py \
     -i "codes_0.npy" \

+ 1 - 6
fish_speech/inference_engine/__init__.py

@@ -15,7 +15,6 @@ from fish_speech.models.text2semantic.inference import (
     WrappedGenerateResponse,
 )
 from fish_speech.models.vqgan.modules.firefly import FireflyArchitecture
-from fish_speech.text.chn_text_norm.text import Text as ChnNormedText
 from fish_speech.utils import autocast_exclude_mps, set_seed
 from fish_speech.utils.schema import ServeTTSRequest
 
@@ -150,11 +149,7 @@ class TTSInferenceEngine(ReferenceLoader, VQManager):
         request = dict(
             device=self.decoder_model.device,
             max_new_tokens=req.max_new_tokens,
-            text=(
-                req.text
-                if not req.normalize
-                else ChnNormedText(raw_text=req.text).normalize()
-            ),
+            text=req.text,
             top_p=req.top_p,
             repetition_penalty=req.repetition_penalty,
             temperature=req.temperature,

+ 0 - 10
fish_speech/inference_engine/utils.py

@@ -5,8 +5,6 @@ from typing import Literal, Optional, Tuple
 
 import numpy as np
 
-from fish_speech.text.chn_text_norm.text import Text as ChnNormedText
-
 
 @dataclass
 class InferenceResult:
@@ -15,14 +13,6 @@ class InferenceResult:
     error: Optional[Exception]
 
 
-def normalize_text(user_input: str, use_normalization: bool) -> str:
-    """Normalize user input text if needed."""
-    if use_normalization:
-        return ChnNormedText(raw_text=user_input).normalize()
-    else:
-        return user_input
-
-
 def wav_chunk_header(
     sample_rate: int = 44100, bit_depth: int = 16, channels: int = 1
 ) -> bytes:

+ 0 - 114
fish_speech/text/chn_text_norm/.gitignore

@@ -1,114 +0,0 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-.hypothesis/
-.pytest_cache/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-target/
-
-# Jupyter Notebook
-.ipynb_checkpoints
-
-# pyenv
-.python-version
-
-# celery beat schedule file
-celerybeat-schedule
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-
-# JetBrains PyCharm
-.idea
-
-# Customize
-references
-url.txt
-
-# Git
-.git

+ 0 - 36
fish_speech/text/chn_text_norm/README.md

@@ -1,36 +0,0 @@
-# This account is no longer in use, see [Atomicoo](https://github.com/atomicoo) for my latest works.
-
-# Chn Text Norm
-
-this is a repository for chinese text normalization (no longer maintained).
-
-## Quick Start ##
-
-### Git Clone Repo ###
-
-git clone this repo to the root directory of your project which need to use it.
-
-    cd /path/to/proj
-    git clone https://github.com/Joee1995/chn-text-norm.git
-
-after that, your doc tree should be:
-```
-proj                     # root of your project
-|--- chn_text_norm       # this chn-text-norm tool
-     |--- text.py
-     |--- ...
-|--- text_normalize.py   # your text normalization code
-|--- ...
-```
-
-### How to Use ? ###
-
-    # text_normalize.py
-    from chn_text_norm.text import *
-    
-    raw_text = 'your raw text'
-    text = Text(raw_text=raw_text).normalize()
-
-### How to add quantums ###
-
-打开test.py,然后你就知道怎么做了。

+ 0 - 0
fish_speech/text/chn_text_norm/__init__.py


+ 0 - 172
fish_speech/text/chn_text_norm/basic_class.py

@@ -1,172 +0,0 @@
-# -*- coding: utf-8 -*-
-"""基本类
-中文字符类
-中文数字/数位类
-中文数字类
-中文数位类
-中文数字系统类
-中文数学符号类
-*中文其他符号类
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-02"
-
-from fish_speech.text.chn_text_norm.basic_constant import NUMBERING_TYPES
-
-
-class ChineseChar(object):
-    """
-    中文字符
-    每个字符对应简体和繁体,
-    e.g. 简体 = '负', 繁体 = '負'
-    转换时可转换为简体或繁体
-    """
-
-    def __init__(self, simplified, traditional):
-        self.simplified = simplified
-        self.traditional = traditional
-        self.__repr__ = self.__str__
-
-    def __str__(self):
-        return self.simplified or self.traditional or None
-
-    def __repr__(self):
-        return self.__str__()
-
-
-class ChineseNumberUnit(ChineseChar):
-    """
-    中文数字/数位字符
-    每个字符除繁简体外还有一个额外的大写字符
-    e.g. '陆' 和 '陸'
-    """
-
-    def __init__(self, power, simplified, traditional, big_s, big_t):
-        super(ChineseNumberUnit, self).__init__(simplified, traditional)
-        self.power = power
-        self.big_s = big_s
-        self.big_t = big_t
-
-    def __str__(self):
-        return "10^{}".format(self.power)
-
-    @classmethod
-    def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False):
-
-        if small_unit:
-            return ChineseNumberUnit(
-                power=index + 1,
-                simplified=value[0],
-                traditional=value[1],
-                big_s=value[1],
-                big_t=value[1],
-            )
-        elif numbering_type == NUMBERING_TYPES[0]:
-            return ChineseNumberUnit(
-                power=index + 8,
-                simplified=value[0],
-                traditional=value[1],
-                big_s=value[0],
-                big_t=value[1],
-            )
-        elif numbering_type == NUMBERING_TYPES[1]:
-            return ChineseNumberUnit(
-                power=(index + 2) * 4,
-                simplified=value[0],
-                traditional=value[1],
-                big_s=value[0],
-                big_t=value[1],
-            )
-        elif numbering_type == NUMBERING_TYPES[2]:
-            return ChineseNumberUnit(
-                power=pow(2, index + 3),
-                simplified=value[0],
-                traditional=value[1],
-                big_s=value[0],
-                big_t=value[1],
-            )
-        else:
-            raise ValueError(
-                "Counting type should be in {0} ({1} provided).".format(
-                    NUMBERING_TYPES, numbering_type
-                )
-            )
-
-
-class ChineseNumberDigit(ChineseChar):
-    """
-    中文数字字符
-    """
-
-    def __init__(
-        self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None
-    ):
-        super(ChineseNumberDigit, self).__init__(simplified, traditional)
-        self.value = value
-        self.big_s = big_s
-        self.big_t = big_t
-        self.alt_s = alt_s
-        self.alt_t = alt_t
-
-    def __str__(self):
-        return str(self.value)
-
-    @classmethod
-    def create(cls, i, v):
-        return ChineseNumberDigit(i, v[0], v[1], v[2], v[3])
-
-
-class ChineseMath(ChineseChar):
-    """
-    中文数位字符
-    """
-
-    def __init__(self, simplified, traditional, symbol, expression=None):
-        super(ChineseMath, self).__init__(simplified, traditional)
-        self.symbol = symbol
-        self.expression = expression
-        self.big_s = simplified
-        self.big_t = traditional
-
-
-CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath
-
-
-class NumberSystem(object):
-    """
-    中文数字系统
-    """
-
-    pass
-
-
-class MathSymbol(object):
-    """
-    用于中文数字系统的数学符号 (繁/简体), e.g.
-    positive = ['正', '正']
-    negative = ['负', '負']
-    point = ['点', '點']
-    """
-
-    def __init__(self, positive, negative, point):
-        self.positive = positive
-        self.negative = negative
-        self.point = point
-
-    def __iter__(self):
-        for v in self.__dict__.values():
-            yield v
-
-
-# class OtherSymbol(object):
-#     """
-#     其他符号
-#     """
-#
-#     def __init__(self, sil):
-#         self.sil = sil
-#
-#     def __iter__(self):
-#         for v in self.__dict__.values():
-#             yield v

+ 0 - 30
fish_speech/text/chn_text_norm/basic_constant.py

@@ -1,30 +0,0 @@
-# -*- coding: utf-8 -*-
-"""基本常量
-中文数字/数位/符号字符常量
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-02"
-
-CHINESE_DIGIS = "零一二三四五六七八九"
-BIG_CHINESE_DIGIS_SIMPLIFIED = "零壹贰叁肆伍陆柒捌玖"
-BIG_CHINESE_DIGIS_TRADITIONAL = "零壹貳參肆伍陸柒捌玖"
-SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = "十百千万"
-SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = "拾佰仟萬"
-LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = "亿兆京垓秭穰沟涧正载"
-LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = "億兆京垓秭穰溝澗正載"
-SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = "十百千万"
-SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = "拾佰仟萬"
-
-ZERO_ALT = "〇"
-ONE_ALT = "幺"
-TWO_ALTS = ["两", "兩"]
-
-POSITIVE = ["正", "正"]
-NEGATIVE = ["负", "負"]
-POINT = ["点", "點"]
-# PLUS = [u'加', u'加']
-# SIL = [u'杠', u'槓']
-
-# 中文数字系统类型
-NUMBERING_TYPES = ["low", "mid", "high"]

+ 0 - 342
fish_speech/text/chn_text_norm/basic_util.py

@@ -1,342 +0,0 @@
-# -*- coding: utf-8 -*-
-"""基本方法
-创建中文数字系统 方法
-中文字符串 <=> 数字串 方法
-数字串 <=> 中文字符串 方法
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-02"
-
-from fish_speech.text.chn_text_norm.basic_class import *
-from fish_speech.text.chn_text_norm.basic_constant import *
-
-
-def create_system(numbering_type=NUMBERING_TYPES[1]):
-    """
-    根据数字系统类型返回创建相应的数字系统,默认为 mid
-    NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型
-        low:  '兆' = '亿' * '十' = $10^{9}$,  '京' = '兆' * '十', etc.
-        mid:  '兆' = '亿' * '万' = $10^{12}$, '京' = '兆' * '万', etc.
-        high: '兆' = '亿' * '亿' = $10^{16}$, '京' = '兆' * '兆', etc.
-    返回对应的数字系统
-    """
-
-    # chinese number units of '亿' and larger
-    all_larger_units = zip(
-        LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED,
-        LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL,
-    )
-    larger_units = [
-        CNU.create(i, v, numbering_type, False) for i, v in enumerate(all_larger_units)
-    ]
-    # chinese number units of '十, 百, 千, 万'
-    all_smaller_units = zip(
-        SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED,
-        SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL,
-    )
-    smaller_units = [
-        CNU.create(i, v, small_unit=True) for i, v in enumerate(all_smaller_units)
-    ]
-    # digis
-    chinese_digis = zip(
-        CHINESE_DIGIS,
-        CHINESE_DIGIS,
-        BIG_CHINESE_DIGIS_SIMPLIFIED,
-        BIG_CHINESE_DIGIS_TRADITIONAL,
-    )
-    digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)]
-    digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT
-    digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT
-    digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1]
-
-    # symbols
-    positive_cn = CM(POSITIVE[0], POSITIVE[1], "+", lambda x: x)
-    negative_cn = CM(NEGATIVE[0], NEGATIVE[1], "-", lambda x: -x)
-    point_cn = CM(POINT[0], POINT[1], ".", lambda x, y: float(str(x) + "." + str(y)))
-    # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y)))
-    system = NumberSystem()
-    system.units = smaller_units + larger_units
-    system.digits = digits
-    system.math = MathSymbol(positive_cn, negative_cn, point_cn)
-    # system.symbols = OtherSymbol(sil_cn)
-    return system
-
-
-def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]):
-
-    def get_symbol(char, system):
-        for u in system.units:
-            if char in [u.traditional, u.simplified, u.big_s, u.big_t]:
-                return u
-        for d in system.digits:
-            if char in [
-                d.traditional,
-                d.simplified,
-                d.big_s,
-                d.big_t,
-                d.alt_s,
-                d.alt_t,
-            ]:
-                return d
-        for m in system.math:
-            if char in [m.traditional, m.simplified]:
-                return m
-
-    def string2symbols(chinese_string, system):
-        int_string, dec_string = chinese_string, ""
-        for p in [system.math.point.simplified, system.math.point.traditional]:
-            if p in chinese_string:
-                int_string, dec_string = chinese_string.split(p)
-                break
-        return [get_symbol(c, system) for c in int_string], [
-            get_symbol(c, system) for c in dec_string
-        ]
-
-    def correct_symbols(integer_symbols, system):
-        """
-        一百八 to 一百八十
-        一亿一千三百万 to 一亿 一千万 三百万
-        """
-
-        if integer_symbols and isinstance(integer_symbols[0], CNU):
-            if integer_symbols[0].power == 1:
-                integer_symbols = [system.digits[1]] + integer_symbols
-
-        if len(integer_symbols) > 1:
-            if isinstance(integer_symbols[-1], CND) and isinstance(
-                integer_symbols[-2], CNU
-            ):
-                integer_symbols.append(
-                    CNU(integer_symbols[-2].power - 1, None, None, None, None)
-                )
-
-        result = []
-        unit_count = 0
-        for s in integer_symbols:
-            if isinstance(s, CND):
-                result.append(s)
-                unit_count = 0
-            elif isinstance(s, CNU):
-                current_unit = CNU(s.power, None, None, None, None)
-                unit_count += 1
-
-            if unit_count == 1:
-                result.append(current_unit)
-            elif unit_count > 1:
-                for i in range(len(result)):
-                    if (
-                        isinstance(result[-i - 1], CNU)
-                        and result[-i - 1].power < current_unit.power
-                    ):
-                        result[-i - 1] = CNU(
-                            result[-i - 1].power + current_unit.power,
-                            None,
-                            None,
-                            None,
-                            None,
-                        )
-        return result
-
-    def compute_value(integer_symbols):
-        """
-        Compute the value.
-        When current unit is larger than previous unit, current unit * all previous units will be used as all previous units.
-        e.g. '两千万' = 2000 * 10000 not 2000 + 10000
-        """
-        value = [0]
-        last_power = 0
-        for s in integer_symbols:
-            if isinstance(s, CND):
-                value[-1] = s.value
-            elif isinstance(s, CNU):
-                value[-1] *= pow(10, s.power)
-                if s.power > last_power:
-                    value[:-1] = list(map(lambda v: v * pow(10, s.power), value[:-1]))
-                    last_power = s.power
-                value.append(0)
-        return sum(value)
-
-    system = create_system(numbering_type)
-    int_part, dec_part = string2symbols(chinese_string, system)
-    int_part = correct_symbols(int_part, system)
-    int_str = str(compute_value(int_part))
-    dec_str = "".join([str(d.value) for d in dec_part])
-    if dec_part:
-        return "{0}.{1}".format(int_str, dec_str)
-    else:
-        return int_str
-
-
-def num2chn(
-    number_string,
-    numbering_type=NUMBERING_TYPES[1],
-    big=False,
-    traditional=False,
-    alt_zero=False,
-    alt_one=False,
-    alt_two=True,
-    use_zeros=True,
-    use_units=True,
-):
-
-    def get_value(value_string, use_zeros=True):
-
-        striped_string = value_string.lstrip("0")
-
-        # record nothing if all zeros
-        if not striped_string:
-            return []
-
-        # record one digits
-        elif len(striped_string) == 1:
-            if use_zeros and len(value_string) != len(striped_string):
-                return [system.digits[0], system.digits[int(striped_string)]]
-            else:
-                return [system.digits[int(striped_string)]]
-
-        # recursively record multiple digits
-        else:
-            result_unit = next(
-                u for u in reversed(system.units) if u.power < len(striped_string)
-            )
-            result_string = value_string[: -result_unit.power]
-            return (
-                get_value(result_string)
-                + [result_unit]
-                + get_value(striped_string[-result_unit.power :])
-            )
-
-    system = create_system(numbering_type)
-
-    int_dec = number_string.split(".")
-    if len(int_dec) == 1:
-        int_string = int_dec[0]
-        dec_string = ""
-    elif len(int_dec) == 2:
-        int_string = int_dec[0]
-        dec_string = int_dec[1]
-    else:
-        raise ValueError(
-            "invalid input num string with more than one dot: {}".format(number_string)
-        )
-
-    if use_units and len(int_string) > 1:
-        result_symbols = get_value(int_string)
-    else:
-        result_symbols = [system.digits[int(c)] for c in int_string]
-    dec_symbols = [system.digits[int(c)] for c in dec_string]
-    if dec_string:
-        result_symbols += [system.math.point] + dec_symbols
-
-    if alt_two:
-        liang = CND(
-            2,
-            system.digits[2].alt_s,
-            system.digits[2].alt_t,
-            system.digits[2].big_s,
-            system.digits[2].big_t,
-        )
-        for i, v in enumerate(result_symbols):
-            if isinstance(v, CND) and v.value == 2:
-                next_symbol = (
-                    result_symbols[i + 1] if i < len(result_symbols) - 1 else None
-                )
-                previous_symbol = result_symbols[i - 1] if i > 0 else None
-                if isinstance(next_symbol, CNU) and isinstance(
-                    previous_symbol, (CNU, type(None))
-                ):
-                    if next_symbol.power != 1 and (
-                        (previous_symbol is None) or (previous_symbol.power != 1)
-                    ):
-                        result_symbols[i] = liang
-
-    # if big is True, '两' will not be used and `alt_two` has no impact on output
-    if big:
-        attr_name = "big_"
-        if traditional:
-            attr_name += "t"
-        else:
-            attr_name += "s"
-    else:
-        if traditional:
-            attr_name = "traditional"
-        else:
-            attr_name = "simplified"
-
-    result = "".join([getattr(s, attr_name) for s in result_symbols])
-
-    # if not use_zeros:
-    #     result = result.strip(getattr(system.digits[0], attr_name))
-
-    if alt_zero:
-        result = result.replace(
-            getattr(system.digits[0], attr_name), system.digits[0].alt_s
-        )
-
-    if alt_one:
-        result = result.replace(
-            getattr(system.digits[1], attr_name), system.digits[1].alt_s
-        )
-
-    for i, p in enumerate(POINT):
-        if result.startswith(p):
-            return CHINESE_DIGIS[0] + result
-
-    # ^10, 11, .., 19
-    if (
-        len(result) >= 2
-        and result[1]
-        in [
-            SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0],
-            SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0],
-        ]
-        and result[0]
-        in [
-            CHINESE_DIGIS[1],
-            BIG_CHINESE_DIGIS_SIMPLIFIED[1],
-            BIG_CHINESE_DIGIS_TRADITIONAL[1],
-        ]
-    ):
-        result = result[1:]
-
-    return result
-
-
-if __name__ == "__main__":
-
-    # 测试程序
-    all_chinese_number_string = (
-        CHINESE_DIGIS
-        + BIG_CHINESE_DIGIS_SIMPLIFIED
-        + BIG_CHINESE_DIGIS_TRADITIONAL
-        + LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED
-        + LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL
-        + SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED
-        + SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL
-        + ZERO_ALT
-        + ONE_ALT
-        + "".join(TWO_ALTS + POSITIVE + NEGATIVE + POINT)
-    )
-
-    print("num:", chn2num("一万零四百零三点八零五"))
-    print("num:", chn2num("一亿六点三"))
-    print("num:", chn2num("一亿零六点三"))
-    print("num:", chn2num("两千零一亿六点三"))
-    # print('num:', chn2num('一零零八六'))
-    print("txt:", num2chn("10260.03", alt_zero=True))
-    print("txt:", num2chn("20037.090", numbering_type="low", traditional=True))
-    print("txt:", num2chn("100860001.77", numbering_type="high", big=True))
-    print(
-        "txt:",
-        num2chn(
-            "059523810880",
-            alt_one=True,
-            alt_two=False,
-            use_lzeros=True,
-            use_rzeros=True,
-            use_units=False,
-        ),
-    )
-
-    print(all_chinese_number_string)

+ 0 - 32
fish_speech/text/chn_text_norm/cardinal.py

@@ -1,32 +0,0 @@
-# -*- coding: utf-8 -*-
-"""CARDINAL类 (包含小数DECIMAL类)
-纯数 <=> 中文字符串 方法
-中文字符串 <=> 纯数 方法
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-03"
-
-from fish_speech.text.chn_text_norm.basic_util import *
-
-
-class Cardinal:
-    """
-    CARDINAL类
-    """
-
-    def __init__(self, cardinal=None, chntext=None):
-        self.cardinal = cardinal
-        self.chntext = chntext
-
-    def chntext2cardinal(self):
-        return chn2num(self.chntext)
-
-    def cardinal2chntext(self):
-        return num2chn(self.cardinal)
-
-
-if __name__ == "__main__":
-
-    # 测试程序
-    print(Cardinal(cardinal="21357.230").cardinal2chntext())

+ 0 - 75
fish_speech/text/chn_text_norm/date.py

@@ -1,75 +0,0 @@
-# -*- coding: utf-8 -*-
-"""DATE类
-日期 <=> 中文字符串 方法
-中文字符串 <=> 日期 方法
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-07"
-
-from fish_speech.text.chn_text_norm.cardinal import Cardinal
-from fish_speech.text.chn_text_norm.digit import Digit
-
-
-class Date:
-    """
-    DATE类
-    """
-
-    def __init__(self, date=None, chntext=None):
-        self.date = date
-        self.chntext = chntext
-
-    # def chntext2date(self):
-    #     chntext = self.chntext
-    #     try:
-    #         year, other = chntext.strip().split('年', maxsplit=1)
-    #         year = Digit(chntext=year).digit2chntext() + '年'
-    #     except ValueError:
-    #         other = chntext
-    #         year = ''
-    #     if other:
-    #         try:
-    #             month, day = other.strip().split('月', maxsplit=1)
-    #             month = Cardinal(chntext=month).chntext2cardinal() + '月'
-    #         except ValueError:
-    #             day = chntext
-    #             month = ''
-    #         if day:
-    #             day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1]
-    #     else:
-    #         month = ''
-    #         day = ''
-    #     date = year + month + day
-    #     self.date = date
-    #     return self.date
-
-    def date2chntext(self):
-        date = self.date
-        try:
-            year, other = date.strip().split("年", maxsplit=1)
-            year = Digit(digit=year).digit2chntext() + "年"
-        except ValueError:
-            other = date
-            year = ""
-        if other:
-            try:
-                month, day = other.strip().split("月", maxsplit=1)
-                month = Cardinal(cardinal=month).cardinal2chntext() + "月"
-            except ValueError:
-                day = date
-                month = ""
-            if day:
-                day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1]
-        else:
-            month = ""
-            day = ""
-        chntext = year + month + day
-        self.chntext = chntext
-        return self.chntext
-
-
-if __name__ == "__main__":
-
-    # 测试
-    print(Date(date="09年3月16日").date2chntext())

+ 0 - 32
fish_speech/text/chn_text_norm/digit.py

@@ -1,32 +0,0 @@
-# -*- coding: utf-8 -*-
-"""DIGIT类
-数字串 <=> 中文字符串 方法
-中文字符串 <=> 数字串 方法
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-03"
-
-from fish_speech.text.chn_text_norm.basic_util import *
-
-
-class Digit:
-    """
-    DIGIT类
-    """
-
-    def __init__(self, digit=None, chntext=None):
-        self.digit = digit
-        self.chntext = chntext
-
-    # def chntext2digit(self):
-    #     return chn2num(self.chntext)
-
-    def digit2chntext(self):
-        return num2chn(self.digit, alt_two=False, use_units=False)
-
-
-if __name__ == "__main__":
-
-    # 测试程序
-    print(Digit(digit="2016").digit2chntext())

+ 0 - 35
fish_speech/text/chn_text_norm/fraction.py

@@ -1,35 +0,0 @@
-# -*- coding: utf-8 -*-
-"""FRACTION类
-分数 <=> 中文字符串 方法
-中文字符串 <=> 分数 方法
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-03"
-
-from fish_speech.text.chn_text_norm.basic_util import *
-
-
-class Fraction:
-    """
-    FRACTION类
-    """
-
-    def __init__(self, fraction=None, chntext=None):
-        self.fraction = fraction
-        self.chntext = chntext
-
-    def chntext2fraction(self):
-        denominator, numerator = self.chntext.split("分之")
-        return chn2num(numerator) + "/" + chn2num(denominator)
-
-    def fraction2chntext(self):
-        numerator, denominator = self.fraction.split("/")
-        return num2chn(denominator) + "分之" + num2chn(numerator)
-
-
-if __name__ == "__main__":
-
-    # 测试程序
-    print(Fraction(fraction="2135/7230").fraction2chntext())
-    print(Fraction(chntext="五百八十一分之三百六十九").chntext2fraction())

+ 0 - 43
fish_speech/text/chn_text_norm/money.py

@@ -1,43 +0,0 @@
-# -*- coding: utf-8 -*-
-"""MONEY类
-金钱 <=> 中文字符串 方法
-中文字符串 <=> 金钱 方法
-"""
-import re
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-08"
-
-from fish_speech.text.chn_text_norm.cardinal import Cardinal
-
-
-class Money:
-    """
-    MONEY类
-    """
-
-    def __init__(self, money=None, chntext=None):
-        self.money = money
-        self.chntext = chntext
-
-    # def chntext2money(self):
-    #     return self.money
-
-    def money2chntext(self):
-        money = self.money
-        pattern = re.compile(r"(\d+(\.\d+)?)")
-        matchers = pattern.findall(money)
-        if matchers:
-            for matcher in matchers:
-                money = money.replace(
-                    matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext()
-                )
-        self.chntext = money
-        return self.chntext
-
-
-if __name__ == "__main__":
-
-    # 测试
-    print(Money(money="21.5万元").money2chntext())
-    print(Money(money="230块5毛").money2chntext())

+ 0 - 33
fish_speech/text/chn_text_norm/percentage.py

@@ -1,33 +0,0 @@
-# -*- coding: utf-8 -*-
-"""PERCENTAGE类
-百分数 <=> 中文字符串 方法
-中文字符串 <=> 百分数 方法
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-06"
-
-from fish_speech.text.chn_text_norm.basic_util import *
-
-
-class Percentage:
-    """
-    PERCENTAGE类
-    """
-
-    def __init__(self, percentage=None, chntext=None):
-        self.percentage = percentage
-        self.chntext = chntext
-
-    def chntext2percentage(self):
-        return chn2num(self.chntext.strip().strip("百分之")) + "%"
-
-    def percentage2chntext(self):
-        return "百分之" + num2chn(self.percentage.strip().strip("%"))
-
-
-if __name__ == "__main__":
-
-    # 测试程序
-    print(Percentage(chntext="百分之五十六点零三").chntext2percentage())
-    print(Percentage(percentage="65.3%").percentage2chntext())

+ 0 - 51
fish_speech/text/chn_text_norm/telephone.py

@@ -1,51 +0,0 @@
-# -*- coding: utf-8 -*-
-"""TELEPHONE类
-电话号码 <=> 中文字符串 方法
-中文字符串 <=> 电话号码 方法
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-03"
-
-from fish_speech.text.chn_text_norm.basic_util import *
-
-
-class TelePhone:
-    """
-    TELEPHONE类
-    """
-
-    def __init__(self, telephone=None, raw_chntext=None, chntext=None):
-        self.telephone = telephone
-        self.raw_chntext = raw_chntext
-        self.chntext = chntext
-
-    # def chntext2telephone(self):
-    #     sil_parts = self.raw_chntext.split('<SIL>')
-    #     self.telephone = '-'.join([
-    #         str(chn2num(p)) for p in sil_parts
-    #     ])
-    #     return self.telephone
-
-    def telephone2chntext(self, fixed=False):
-
-        if fixed:
-            sil_parts = self.telephone.split("-")
-            self.raw_chntext = "<SIL>".join(
-                [num2chn(part, alt_two=False, use_units=False) for part in sil_parts]
-            )
-            self.chntext = self.raw_chntext.replace("<SIL>", "")
-        else:
-            sp_parts = self.telephone.strip("+").split()
-            self.raw_chntext = "<SP>".join(
-                [num2chn(part, alt_two=False, use_units=False) for part in sp_parts]
-            )
-            self.chntext = self.raw_chntext.replace("<SP>", "")
-        return self.chntext
-
-
-if __name__ == "__main__":
-
-    # 测试程序
-    print(TelePhone(telephone="0595-23980880").telephone2chntext())
-    # print(TelePhone(raw_chntext='零五九五杠二三八六五零九八').chntext2telephone())

+ 0 - 177
fish_speech/text/chn_text_norm/text.py

@@ -1,177 +0,0 @@
-# -*- coding: utf-8 -*-
-"""
-TEXT类
-"""
-
-__author__ = "Zhiyang Zhou <zyzhou@stu.xmu.edu.cn>"
-__data__ = "2019-05-03"
-
-import re
-
-from fish_speech.text.chn_text_norm.cardinal import Cardinal
-from fish_speech.text.chn_text_norm.date import Date
-from fish_speech.text.chn_text_norm.digit import Digit
-from fish_speech.text.chn_text_norm.fraction import Fraction
-from fish_speech.text.chn_text_norm.money import Money
-from fish_speech.text.chn_text_norm.percentage import Percentage
-from fish_speech.text.chn_text_norm.telephone import TelePhone
-
-CURRENCY_NAMES = (
-    "(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|"
-    "里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)"
-)
-CURRENCY_UNITS = "((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)"
-COM_QUANTIFIERS = (
-    "(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|"
-    "砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|"
-    "针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|"
-    "毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|"
-    "盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|"
-    "纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块|人|抽)"
-)
-
-
-class Text:
-    """
-    Text类
-    """
-
-    def __init__(self, raw_text, norm_text=None):
-        self.raw_text = "^" + raw_text + "$"
-        self.norm_text = norm_text
-
-    def _particular(self):
-        text = self.norm_text
-        pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))")
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('particular')
-            for matcher in matchers:
-                text = text.replace(matcher[0], matcher[1] + "2" + matcher[2], 1)
-        self.norm_text = text
-        return self.norm_text
-
-    def normalize(self):
-        text = self.raw_text
-
-        # 规范化日期
-        pattern = re.compile(
-            r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)"
-        )
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('date')
-            for matcher in matchers:
-                text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1)
-
-        # 规范化金钱
-        pattern = re.compile(
-            r"\D+((\d+(\.\d+)?)[多余几]?"
-            + CURRENCY_UNITS
-            + "(\d"
-            + CURRENCY_UNITS
-            + "?)?)"
-        )
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('money')
-            for matcher in matchers:
-                text = text.replace(
-                    matcher[0], Money(money=matcher[0]).money2chntext(), 1
-                )
-
-        # 规范化固话/手机号码
-        # 手机
-        # http://www.jihaoba.com/news/show/13680
-        # 移动:139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198
-        # 联通:130、131、132、156、155、186、185、176
-        # 电信:133、153、189、180、181、177
-        pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D")
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('telephone')
-            for matcher in matchers:
-                text = text.replace(
-                    matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1
-                )
-        # 固话
-        pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D")
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('fixed telephone')
-            for matcher in matchers:
-                text = text.replace(
-                    matcher[0],
-                    TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True),
-                    1,
-                )
-
-        # 规范化分数
-        pattern = re.compile(r"(\d+/\d+)")
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('fraction')
-            for matcher in matchers:
-                text = text.replace(
-                    matcher, Fraction(fraction=matcher).fraction2chntext(), 1
-                )
-
-        # 规范化百分数
-        text = text.replace("%", "%")
-        pattern = re.compile(r"(\d+(\.\d+)?%)")
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('percentage')
-            for matcher in matchers:
-                text = text.replace(
-                    matcher[0],
-                    Percentage(percentage=matcher[0]).percentage2chntext(),
-                    1,
-                )
-
-        # 规范化纯数+量词
-        pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS)
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('cardinal+quantifier')
-            for matcher in matchers:
-                text = text.replace(
-                    matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1
-                )
-
-        # 规范化数字编号
-        pattern = re.compile(r"(\d{4,32})")
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('digit')
-            for matcher in matchers:
-                text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1)
-
-        # 规范化纯数
-        pattern = re.compile(r"(\d+(\.\d+)?)")
-        matchers = pattern.findall(text)
-        if matchers:
-            # print('cardinal')
-            for matcher in matchers:
-                text = text.replace(
-                    matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1
-                )
-
-        self.norm_text = text
-        self._particular()
-
-        return self.norm_text.lstrip("^").rstrip("$")
-
-
-if __name__ == "__main__":
-
-    # 测试程序
-    print(Text(raw_text="固话:0595-23865596或23880880。").normalize())
-    print(Text(raw_text="手机:+86 19859213959或15659451527。").normalize())
-    print(Text(raw_text="分数:32477/76391。").normalize())
-    print(Text(raw_text="百分数:80.03%。").normalize())
-    print(Text(raw_text="编号:31520181154418。").normalize())
-    print(Text(raw_text="纯数:2983.07克或12345.60米。").normalize())
-    print(Text(raw_text="日期:1999年2月20日或09年3月15号。").normalize())
-    print(Text(raw_text="金钱:12块5,34.5元,20.1万").normalize())
-    print(Text(raw_text="特殊:O2O或B2C。").normalize())

+ 0 - 184
install_env.bat

@@ -1,184 +0,0 @@
-@echo off
-chcp 65001
-
-set USE_MIRROR=true
-echo "USE_MIRROR: %USE_MIRROR%"
-setlocal enabledelayedexpansion
-
-cd /D "%~dp0"
-
-set PATH="%PATH%";%SystemRoot%\system32
-
-echo %PATH%
-
-
-echo "%CD%"| findstr /R /C:"[!#\$%&()\*+,;<=>?@\[\]\^`{|}~\u4E00-\u9FFF ] " >nul && (
-    echo.
-    echo There are special characters in the current path, please make the path of fish-speech free of special characters before running. && (
-        goto end
-    )
-)
-
-
-set TMP=%CD%\fishenv
-set TEMP=%CD%\fishenv
-
-(call conda deactivate && call conda deactivate && call conda deactivate) 2>nul
-
-set INSTALL_DIR=%cd%\fishenv
-set CONDA_ROOT_PREFIX=%cd%\fishenv\conda
-set INSTALL_ENV_DIR=%cd%\fishenv\env
-set PIP_CMD=%cd%\fishenv\env\python -m pip
-set PYTHON_CMD=%cd%\fishenv\env\python
-set API_FLAG_PATH=%~dp0API_FLAGS.txt
-set MINICONDA_DOWNLOAD_URL=https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Windows-x86_64.exe
-if "!USE_MIRROR!" == "true" (
-    set MINICONDA_DOWNLOAD_URL=https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py310_23.3.1-0-Windows-x86_64.exe
-)
-set MINICONDA_CHECKSUM=307194e1f12bbeb52b083634e89cc67db4f7980bd542254b43d3309eaf7cb358
-set conda_exists=F
-
-call "%CONDA_ROOT_PREFIX%\_conda.exe" --version >nul 2>&1
-if "%ERRORLEVEL%" EQU "0" set conda_exists=T
-
-if "%conda_exists%" == "F" (
-    echo.
-    echo Downloading Miniconda...
-    mkdir "%INSTALL_DIR%" 2>nul
-    call curl -Lk "%MINICONDA_DOWNLOAD_URL%" > "%INSTALL_DIR%\miniconda_installer.exe"
-    if errorlevel 1 (
-        echo.
-        echo Failed to download miniconda.
-        goto end
-    )
-    for /f %%a in ('
-        certutil -hashfile "%INSTALL_DIR%\miniconda_installer.exe" sha256
-        ^| find /i /v " "
-        ^| find /i "%MINICONDA_CHECKSUM%"
-    ') do (
-        set "hash=%%a"
-    )
-    if not defined hash (
-        echo.
-        echo Miniconda hash mismatched!
-        del "%INSTALL_DIR%\miniconda_installer.exe"
-        goto end
-    ) else (
-        echo.
-        echo Miniconda hash matched successfully.
-    )
-    echo Downloaded "%CONDA_ROOT_PREFIX%"
-    start /wait "" "%INSTALL_DIR%\miniconda_installer.exe" /InstallationType=JustMe /NoShortcuts=1 /AddToPath=0 /RegisterPython=0 /NoRegistry=1 /S /D=%CONDA_ROOT_PREFIX%
-
-    call "%CONDA_ROOT_PREFIX%\_conda.exe" --version
-    if errorlevel 1 (
-        echo.
-        echo Cannot install Miniconda.
-        goto end
-    ) else (
-        echo.
-        echo Miniconda Install success.
-    )
-
-    del "%INSTALL_DIR%\miniconda_installer.exe"
-)
-
-
-if not exist "%INSTALL_ENV_DIR%" (
-    echo.
-    echo Creating Conda Environment...
-    if "!USE_MIRROR!" == "true" (
-        call "%CONDA_ROOT_PREFIX%\_conda.exe" create --no-shortcuts -y -k --prefix "%INSTALL_ENV_DIR%" -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ python=3.10
-    ) else (
-        call "%CONDA_ROOT_PREFIX%\_conda.exe" create --no-shortcuts -y -k --prefix "%INSTALL_ENV_DIR%" python=3.10
-    )
-
-    if errorlevel 1 (
-        echo.
-        echo Failed to Create Environment.
-        goto end
-    )
-)
-
-if not exist "%INSTALL_ENV_DIR%\python.exe" (
-    echo.
-    echo Conda Env does not exist.
-    goto end
-)
-
-set PYTHONNOUSERSITE=1
-set PYTHONPATH=
-set PYTHONHOME=
-set "CUDA_PATH=%INSTALL_ENV_DIR%"
-set "CUDA_HOME=%CUDA_PATH%"
-
-call "%CONDA_ROOT_PREFIX%\condabin\conda.bat" activate "%INSTALL_ENV_DIR%"
-
-if errorlevel 1 (
-    echo.
-    echo Failed to activate Env.
-    goto end
-) else (
-    echo.
-    echo successfully create env.
-)
-
-set "HF_ENDPOINT=https://huggingface.co"
-set "no_proxy="
-if "%USE_MIRROR%"=="true" (
-    set "HF_ENDPOINT=https://hf-mirror.com"
-    set "no_proxy=localhost,127.0.0.1,0.0.0.0"
-)
-
-echo "HF_ENDPOINT: !HF_ENDPOINT!"
-echo "NO_PROXY: !no_proxy!"
-
-if "!USE_MIRROR!" == "true" (
-    %PIP_CMD% install torch torchvision torchaudio -U --extra-index-url https://mirrors.bfsu.edu.cn/pypi/web/simple
-) else (
-    %PIP_CMD% install torch torchvision torchaudio -U --index-url https://download.pytorch.org/whl/cu121
-)
-
-%PIP_CMD% install -e . --upgrade-strategy only-if-needed
-
-call :download_and_install "triton_windows-0.1.0-py3-none-any.whl" ^
-        "%HF_ENDPOINT%/datasets/SpicyqSama007/windows_compile/resolve/main/triton_windows-0.1.0-py3-none-any.whl?download=true" ^
-        "2cc998638180f37cf5025ab65e48c7f629aa5a369176cfa32177d2bd9aa26a0a"
-
-
-endlocal
-echo "Environment Check: Success."
-:end
-pause
-
-goto :EOF
-
-
-:download_and_install
-setlocal
-
-set "WHEEL_FILE=%1"
-set "URL=%2"
-set "CHKSUM=%3"
-
-:DOWNLOAD
-if not exist "%WHEEL_FILE%" (
-    call curl -Lk "%URL%" --output "%WHEEL_FILE%"
-)
-
-for /f "delims=" %%I in ("certutil -hashfile %WHEEL_FILE% SHA256 ^| find /i %CHKSUM%") do (
-    set "FILE_VALID=true"
-)
-
-if not defined FILE_VALID (
-    echo File checksum does not match, re-downloading...
-    del "%WHEEL_FILE%"
-    goto DOWNLOAD
-)
-
-echo "OK for %WHEEL_FILE%"
-%PIP_CMD% install "%WHEEL_FILE%" --no-warn-script-location
-del "%WHEEL_FILE%"
-
-endlocal
-goto :EOF

+ 0 - 50
run_cmd.bat

@@ -1,50 +0,0 @@
-@echo off
-chcp 65001
-
-set no_proxy="127.0.0.1, 0.0.0.0, localhost"
-setlocal enabledelayedexpansion
-
-cd /D "%~dp0"
-
-set PATH="%PATH%";%SystemRoot%\system32
-
-
-echo "%CD%"| findstr /R /C:"[!#\$%&()\*+,;<=>?@\[\]\^`{|}~\u4E00-\u9FFF ] " >nul && (
-    echo.
-    echo There are special characters in the current path, please make the path of fish-speech free of special characters before running. && (
-        goto end
-    )
-)
-
-
-set TMP=%CD%\fishenv
-set TEMP=%CD%\fishenv
-
-
-(call conda deactivate && call conda deactivate && call conda deactivate) 2>nul
-
-
-set CONDA_ROOT_PREFIX=%cd%\fishenv\conda
-set INSTALL_ENV_DIR=%cd%\fishenv\env
-
-
-set PYTHONNOUSERSITE=1
-set PYTHONPATH=%~dp0
-set PYTHONHOME=
-
-
-call "%CONDA_ROOT_PREFIX%\condabin\conda.bat" activate "%INSTALL_ENV_DIR%"
-
-if errorlevel 1 (
-    echo.
-    echo Environment activation failed.
-    goto end
-) else (
-    echo.
-    echo Environment activation succeeded.
-)
-
-cmd /k "%*"
-
-:end
-pause

+ 0 - 97
start.bat

@@ -1,97 +0,0 @@
-@echo off
-chcp 65001
-
-set USE_MIRROR=true
-set PYTHONPATH=%~dp0
-set PYTHON_CMD=python
-if exist "fishenv" (
-    set PYTHON_CMD=%cd%\fishenv\env\python
-)
-
-set API_FLAG_PATH=%~dp0API_FLAGS.txt
-set KMP_DUPLICATE_LIB_OK=TRUE
-
-setlocal enabledelayedexpansion
-
-set "HF_ENDPOINT=https://huggingface.co"
-set "no_proxy="
-if "%USE_MIRROR%" == "true" (
-    set "HF_ENDPOINT=https://hf-mirror.com"
-    set "no_proxy=localhost, 127.0.0.1, 0.0.0.0"
-)
-echo "HF_ENDPOINT: !HF_ENDPOINT!"
-echo "NO_PROXY: !no_proxy!"
-
-echo "%CD%"| findstr /R /C:"[!#\$%&()\*+,;<=>?@\[\]\^`{|}~\u4E00-\u9FFF ] " >nul && (
-    echo.
-    echo There are special characters in the current path, please make the path of fish-speech free of special characters before running. && (
-        goto end
-    )
-)
-
-%PYTHON_CMD% .\tools\download_models.py
-
-set "API_FLAGS="
-set "flags="
-
-if exist "%API_FLAG_PATH%" (
-    for /f "usebackq tokens=*" %%a in ("%API_FLAG_PATH%") do (
-        set "line=%%a"
-        if not "!line:~0,1!"=="#" (
-            set "line=!line: =<SPACE>!"
-            set "line=!line:\=!"
-            set "line=!line:<SPACE>= !"
-            if not "!line!"=="" (
-                set "API_FLAGS=!API_FLAGS!!line! "
-            )
-        )
-    )
-)
-
-
-if not "!API_FLAGS!"=="" set "API_FLAGS=!API_FLAGS:~0,-1!"
-
-set "flags="
-
-echo !API_FLAGS! | findstr /C:"--api" >nul 2>&1
-if !errorlevel! equ 0 (
-    echo.
-    echo Start HTTP API...
-    set "mode=api"
-    goto process_flags
-)
-
-echo !API_FLAGS! | findstr /C:"--infer" >nul 2>&1
-if !errorlevel! equ 0 (
-    echo.
-    echo Start WebUI Inference...
-    set "mode=infer"
-    goto process_flags
-)
-
-
-:process_flags
-for %%p in (!API_FLAGS!) do (
-    if not "%%p"=="--!mode!" (
-        set "flags=!flags! %%p"
-    )
-)
-
-if not "!flags!"=="" set "flags=!flags:~1!"
-
-echo Debug: flags = !flags!
-
-if "!mode!"=="api" (
-    %PYTHON_CMD% -m tools.api_server !flags!
-) else if "!mode!"=="infer" (
-    %PYTHON_CMD% -m tools.webui !flags!
-)
-
-echo.
-echo Next launch the page...
-%PYTHON_CMD% fish_speech\webui\manage.py
-
-
-:end
-endlocal
-pause

+ 0 - 2
tools/api_client.py

@@ -65,7 +65,6 @@ def parse_args():
         default=True,
         help="Whether to play audio after receiving data",
     )
-    parser.add_argument("--normalize", type=bool, default=True)
     parser.add_argument(
         "--format", type=str, choices=["wav", "mp3", "flac"], default="wav"
     )
@@ -160,7 +159,6 @@ if __name__ == "__main__":
             for ref_text, ref_audio in zip(ref_texts, byte_audios)
         ],
         "reference_id": idstr,
-        "normalize": args.normalize,
         "format": args.format,
         "max_new_tokens": args.max_new_tokens,
         "chunk_length": args.chunk_length,

+ 0 - 0
tools/export-onnx.py → tools/export_onnx.py


+ 0 - 17
tools/llama/generate.py

@@ -1,17 +0,0 @@
-import os
-import subprocess
-import sys
-
-#!/usr/bin/env python
-
-
-def main():
-    # Make path relative to this file
-    script_path = os.path.join(
-        os.path.dirname(__file__), "../../fish_speech/models/text2semantic/inference.py"
-    )
-    subprocess.run(["python", script_path] + sys.argv[1:])
-
-
-if __name__ == "__main__":
-    main()

+ 0 - 57
tools/llama/rebuild_tokenizer.py

@@ -1,57 +0,0 @@
-from tokenizers import Tokenizer, decoders, models, pre_tokenizers, processors, trainers
-from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast
-
-# Initialize a tokenizer
-tokenizer = Tokenizer(models.BPE())
-
-# Customize pre-tokenization and decoding
-tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
-tokenizer.decoder = decoders.ByteLevel()
-tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
-
-# Don't train the tokenizer
-trainer = trainers.BpeTrainer(
-    vocab_size=0,
-    min_frequency=2,
-    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
-    special_tokens=[
-        "<|begin_of_sequence|>",
-        "<|end_of_sequence|>",
-        "<|im_start|>",
-        "<|im_sep|>",  # system, user, assistant, etc.
-        "<|im_end|>",
-        "<|semantic|>",  # audio features
-        "<|pad|>",
-    ],
-)
-
-# <|im_start|>user<|im_sep|>...<|im_end|>
-# <|im_start|>assistant<|im_sep|><|semantic|><|semantic|><|semantic|><|semantic|><|semantic|><|im_end|>
-tokenizer.train_from_iterator([], trainer=trainer)
-
-print(len(tokenizer.get_vocab()))
-x = tokenizer.encode(
-    "Hello, how are you? dfgnviadfjoiviouajeiodfjv 你好世界 🈶<|semantic|>"
-).ids
-print(x, len(x))
-print(tokenizer.decode(x, skip_special_tokens=True))
-
-
-tokenizer = PreTrainedTokenizerFast(
-    tokenizer_object=tokenizer,
-    pad_token="<|pad|>",
-    bos_token="<|begin_of_sequence|>",
-    eos_token="<|end_of_sequence|>",
-)
-
-# Try tokenizing a new sequence
-sequence = "All around, too, lay vast quantities of the costliest merchandise, and treasures were heaped in every cranny of the rocks, but all these things only added to the desolation of the scene. 测试中文, 你好世界 🈶<|semantic|>"
-encoded = tokenizer(sequence).input_ids
-
-print("Test encoding....")
-print(f"\tSentence: {sequence}")
-print(f"\tEncoded: {encoded}")
-print(f"\tDecoded: {tokenizer.batch_decode(encoded)}")
-print(f"\tDecoded: {tokenizer.decode(encoded)}")
-
-tokenizer.push_to_hub("fishaudio/fish-speech-1", private=True)

+ 0 - 59
tools/sensevoice/README.md

@@ -1,59 +0,0 @@
-# FunASR Command Line Interface
-
-This tool provides a command-line interface for separating vocals from instrumental tracks, converting videos to audio, and performing speech-to-text transcription on the resulting audio files.
-
-## Requirements
-
-- Python >= 3.10
-- PyTorch <= 2.3.1
-- ffmpeg, pydub, audio-separator[gpu].
-
-## Installation
-
-Install the required packages:
-
-```bash
-pip install -e .[stable]
-```
-
-Make sure you have `ffmpeg` installed and available in your `PATH`.
-
-## Usage
-
-### Basic Usage
-
-To run the tool with default settings:
-
-```bash
-python tools/sensevoice/fun_asr.py --audio-dir <audio_directory> --save-dir <output_directory>
-```
-
-## Options
-
-|          Option           |                                  Description                                  |
-| :-----------------------: | :---------------------------------------------------------------------------: |
-|        --audio-dir        |                  Directory containing audio or video files.                   |
-|        --save-dir         |                   Directory to save processed audio files.                    |
-|         --device          |         Device to use for processing. Options: cuda (default) or cpu.         |
-|        --language         |                Language of the transcription. Default is auto.                |
-| --max_single_segment_time | Maximum duration of a single audio segment in milliseconds. Default is 20000. |
-|          --punc           |                        Enable punctuation prediction.                         |
-|         --denoise         |                  Enable noise reduction (vocal separation).                   |
-
-## Example
-
-To process audio files in the directory `path/to/audio` and save the output to `path/to/output`, with punctuation and noise reduction enabled:
-
-```bash
-python tools/sensevoice/fun_asr.py --audio-dir path/to/audio --save-dir path/to/output --punc --denoise
-```
-
-## Additional Notes
-
-- The tool supports `both audio and video files`. Videos will be converted to audio automatically.
-- If the `--denoise` option is used, the tool will perform vocal separation to isolate the vocals from the instrumental tracks.
-- The script will automatically create necessary directories in the `--save-dir`.
-
-## Troubleshooting
-
-If you encounter any issues, make sure all dependencies are correctly installed and configured. For more detailed troubleshooting, refer to the documentation of each dependency.

+ 0 - 0
tools/sensevoice/__init__.py


+ 0 - 573
tools/sensevoice/auto_model.py

@@ -1,573 +0,0 @@
-#!/usr/bin/env python3
-# -*- encoding: utf-8 -*-
-# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
-#  MIT License  (https://opensource.org/licenses/MIT)
-
-import copy
-import json
-import logging
-import os.path
-import random
-import re
-import string
-import time
-
-import numpy as np
-import torch
-from funasr.download.download_model_from_hub import download_model
-from funasr.download.file import download_from_url
-from funasr.register import tables
-from funasr.train_utils.load_pretrained_model import load_pretrained_model
-from funasr.train_utils.set_all_random_seed import set_all_random_seed
-from funasr.utils import export_utils, misc
-from funasr.utils.load_utils import load_audio_text_image_video, load_bytes
-from funasr.utils.misc import deep_update
-from funasr.utils.timestamp_tools import timestamp_sentence, timestamp_sentence_en
-from tqdm import tqdm
-
-from .vad_utils import merge_vad, slice_padding_audio_samples
-
-try:
-    from funasr.models.campplus.cluster_backend import ClusterBackend
-    from funasr.models.campplus.utils import distribute_spk, postprocess, sv_chunk
-except:
-    pass
-
-
-def prepare_data_iterator(data_in, input_len=None, data_type=None, key=None):
-    """ """
-    data_list = []
-    key_list = []
-    filelist = [".scp", ".txt", ".json", ".jsonl", ".text"]
-
-    chars = string.ascii_letters + string.digits
-    if isinstance(data_in, str):
-        if data_in.startswith("http://") or data_in.startswith("https://"):  # url
-            data_in = download_from_url(data_in)
-
-    if isinstance(data_in, str) and os.path.exists(
-        data_in
-    ):  # wav_path; filelist: wav.scp, file.jsonl;text.txt;
-        _, file_extension = os.path.splitext(data_in)
-        file_extension = file_extension.lower()
-        if file_extension in filelist:  # filelist: wav.scp, file.jsonl;text.txt;
-            with open(data_in, encoding="utf-8") as fin:
-                for line in fin:
-                    key = "rand_key_" + "".join(random.choice(chars) for _ in range(13))
-                    if data_in.endswith(
-                        ".jsonl"
-                    ):  # file.jsonl: json.dumps({"source": data})
-                        lines = json.loads(line.strip())
-                        data = lines["source"]
-                        key = data["key"] if "key" in data else key
-                    else:  # filelist, wav.scp, text.txt: id \t data or data
-                        lines = line.strip().split(maxsplit=1)
-                        data = lines[1] if len(lines) > 1 else lines[0]
-                        key = lines[0] if len(lines) > 1 else key
-
-                    data_list.append(data)
-                    key_list.append(key)
-        else:
-            if key is None:
-                # key = "rand_key_" + "".join(random.choice(chars) for _ in range(13))
-                key = misc.extract_filename_without_extension(data_in)
-            data_list = [data_in]
-            key_list = [key]
-    elif isinstance(data_in, (list, tuple)):
-        if data_type is not None and isinstance(
-            data_type, (list, tuple)
-        ):  # mutiple inputs
-            data_list_tmp = []
-            for data_in_i, data_type_i in zip(data_in, data_type):
-                key_list, data_list_i = prepare_data_iterator(
-                    data_in=data_in_i, data_type=data_type_i
-                )
-                data_list_tmp.append(data_list_i)
-            data_list = []
-            for item in zip(*data_list_tmp):
-                data_list.append(item)
-        else:
-            # [audio sample point, fbank, text]
-            data_list = data_in
-            key_list = []
-            for data_i in data_in:
-                if isinstance(data_i, str) and os.path.exists(data_i):
-                    key = misc.extract_filename_without_extension(data_i)
-                else:
-                    if key is None:
-                        key = "rand_key_" + "".join(
-                            random.choice(chars) for _ in range(13)
-                        )
-                key_list.append(key)
-
-    else:  # raw text; audio sample point, fbank; bytes
-        if isinstance(data_in, bytes):  # audio bytes
-            data_in = load_bytes(data_in)
-        if key is None:
-            key = "rand_key_" + "".join(random.choice(chars) for _ in range(13))
-        data_list = [data_in]
-        key_list = [key]
-
-    return key_list, data_list
-
-
-class AutoModel:
-
-    def __init__(self, **kwargs):
-
-        try:
-            from funasr.utils.version_checker import check_for_update
-
-            print(
-                "Check update of funasr, and it would cost few times. You may disable it by set `disable_update=True` in AutoModel"
-            )
-            check_for_update(disable=kwargs.get("disable_update", False))
-        except:
-            pass
-
-        log_level = getattr(logging, kwargs.get("log_level", "INFO").upper())
-        logging.basicConfig(level=log_level)
-
-        model, kwargs = self.build_model(**kwargs)
-
-        # if vad_model is not None, build vad model else None
-        vad_model = kwargs.get("vad_model", None)
-        vad_kwargs = (
-            {} if kwargs.get("vad_kwargs", {}) is None else kwargs.get("vad_kwargs", {})
-        )
-        if vad_model is not None:
-            logging.info("Building VAD model.")
-            vad_kwargs["model"] = vad_model
-            vad_kwargs["model_revision"] = kwargs.get("vad_model_revision", "master")
-            vad_kwargs["device"] = kwargs["device"]
-            vad_model, vad_kwargs = self.build_model(**vad_kwargs)
-
-        # if punc_model is not None, build punc model else None
-        punc_model = kwargs.get("punc_model", None)
-        punc_kwargs = (
-            {}
-            if kwargs.get("punc_kwargs", {}) is None
-            else kwargs.get("punc_kwargs", {})
-        )
-        if punc_model is not None:
-            logging.info("Building punc model.")
-            punc_kwargs["model"] = punc_model
-            punc_kwargs["model_revision"] = kwargs.get("punc_model_revision", "master")
-            punc_kwargs["device"] = kwargs["device"]
-            punc_model, punc_kwargs = self.build_model(**punc_kwargs)
-
-        # if spk_model is not None, build spk model else None
-        spk_model = kwargs.get("spk_model", None)
-        spk_kwargs = (
-            {} if kwargs.get("spk_kwargs", {}) is None else kwargs.get("spk_kwargs", {})
-        )
-        if spk_model is not None:
-            logging.info("Building SPK model.")
-            spk_kwargs["model"] = spk_model
-            spk_kwargs["model_revision"] = kwargs.get("spk_model_revision", "master")
-            spk_kwargs["device"] = kwargs["device"]
-            spk_model, spk_kwargs = self.build_model(**spk_kwargs)
-            self.cb_model = ClusterBackend().to(kwargs["device"])
-            spk_mode = kwargs.get("spk_mode", "punc_segment")
-            if spk_mode not in ["default", "vad_segment", "punc_segment"]:
-                logging.error(
-                    "spk_mode should be one of default, vad_segment and punc_segment."
-                )
-            self.spk_mode = spk_mode
-
-        self.kwargs = kwargs
-        self.model = model
-        self.vad_model = vad_model
-        self.vad_kwargs = vad_kwargs
-        self.punc_model = punc_model
-        self.punc_kwargs = punc_kwargs
-        self.spk_model = spk_model
-        self.spk_kwargs = spk_kwargs
-        self.model_path = kwargs.get("model_path")
-
-    @staticmethod
-    def build_model(**kwargs):
-        assert "model" in kwargs
-        if "model_conf" not in kwargs:
-            logging.info(
-                "download models from model hub: {}".format(kwargs.get("hub", "ms"))
-            )
-            kwargs = download_model(**kwargs)
-
-        set_all_random_seed(kwargs.get("seed", 0))
-
-        device = kwargs.get("device", "cuda")
-        if not torch.cuda.is_available() or kwargs.get("ngpu", 1) == 0:
-            device = "cpu"
-            kwargs["batch_size"] = 1
-        kwargs["device"] = device
-
-        torch.set_num_threads(kwargs.get("ncpu", 4))
-
-        # build tokenizer
-        tokenizer = kwargs.get("tokenizer", None)
-        if tokenizer is not None:
-            tokenizer_class = tables.tokenizer_classes.get(tokenizer)
-            tokenizer = tokenizer_class(**kwargs.get("tokenizer_conf", {}))
-            kwargs["token_list"] = (
-                tokenizer.token_list if hasattr(tokenizer, "token_list") else None
-            )
-            kwargs["token_list"] = (
-                tokenizer.get_vocab()
-                if hasattr(tokenizer, "get_vocab")
-                else kwargs["token_list"]
-            )
-            vocab_size = (
-                len(kwargs["token_list"]) if kwargs["token_list"] is not None else -1
-            )
-            if vocab_size == -1 and hasattr(tokenizer, "get_vocab_size"):
-                vocab_size = tokenizer.get_vocab_size()
-        else:
-            vocab_size = -1
-        kwargs["tokenizer"] = tokenizer
-
-        # build frontend
-        frontend = kwargs.get("frontend", None)
-        kwargs["input_size"] = None
-        if frontend is not None:
-            frontend_class = tables.frontend_classes.get(frontend)
-            frontend = frontend_class(**kwargs.get("frontend_conf", {}))
-            kwargs["input_size"] = (
-                frontend.output_size() if hasattr(frontend, "output_size") else None
-            )
-        kwargs["frontend"] = frontend
-        # build model
-        model_class = tables.model_classes.get(kwargs["model"])
-        assert model_class is not None, f'{kwargs["model"]} is not registered'
-        model_conf = {}
-        deep_update(model_conf, kwargs.get("model_conf", {}))
-        deep_update(model_conf, kwargs)
-        model = model_class(**model_conf, vocab_size=vocab_size)
-
-        # init_param
-        init_param = kwargs.get("init_param", None)
-        if init_param is not None:
-            if os.path.exists(init_param):
-                logging.info(f"Loading pretrained params from {init_param}")
-                load_pretrained_model(
-                    model=model,
-                    path=init_param,
-                    ignore_init_mismatch=kwargs.get("ignore_init_mismatch", True),
-                    oss_bucket=kwargs.get("oss_bucket", None),
-                    scope_map=kwargs.get("scope_map", []),
-                    excludes=kwargs.get("excludes", None),
-                )
-            else:
-                print(f"error, init_param does not exist!: {init_param}")
-
-        # fp16
-        if kwargs.get("fp16", False):
-            model.to(torch.float16)
-        elif kwargs.get("bf16", False):
-            model.to(torch.bfloat16)
-        model.to(device)
-
-        if not kwargs.get("disable_log", True):
-            tables.print()
-
-        return model, kwargs
-
-    def __call__(self, *args, **cfg):
-        kwargs = self.kwargs
-        deep_update(kwargs, cfg)
-        res = self.model(*args, kwargs)
-        return res
-
-    def generate(self, input, input_len=None, **cfg):
-        if self.vad_model is None:
-            return self.inference(input, input_len=input_len, **cfg)
-
-        else:
-            return self.inference_with_vad(input, input_len=input_len, **cfg)
-
-    def inference(
-        self, input, input_len=None, model=None, kwargs=None, key=None, **cfg
-    ):
-        kwargs = self.kwargs if kwargs is None else kwargs
-        if "cache" in kwargs:
-            kwargs.pop("cache")
-        deep_update(kwargs, cfg)
-        model = self.model if model is None else model
-        model.eval()
-
-        batch_size = kwargs.get("batch_size", 1)
-        # if kwargs.get("device", "cpu") == "cpu":
-        #     batch_size = 1
-
-        key_list, data_list = prepare_data_iterator(
-            input, input_len=input_len, data_type=kwargs.get("data_type", None), key=key
-        )
-
-        speed_stats = {}
-        asr_result_list = []
-        num_samples = len(data_list)
-        disable_pbar = self.kwargs.get("disable_pbar", False)
-        pbar = (
-            tqdm(colour="blue", total=num_samples, dynamic_ncols=True)
-            if not disable_pbar
-            else None
-        )
-        time_speech_total = 0.0
-        time_escape_total = 0.0
-        for beg_idx in range(0, num_samples, batch_size):
-            end_idx = min(num_samples, beg_idx + batch_size)
-            data_batch = data_list[beg_idx:end_idx]
-            key_batch = key_list[beg_idx:end_idx]
-            batch = {"data_in": data_batch, "key": key_batch}
-
-            if (end_idx - beg_idx) == 1 and kwargs.get(
-                "data_type", None
-            ) == "fbank":  # fbank
-                batch["data_in"] = data_batch[0]
-                batch["data_lengths"] = input_len
-
-            time1 = time.perf_counter()
-            with torch.no_grad():
-                res = model.inference(**batch, **kwargs)
-                if isinstance(res, (list, tuple)):
-                    results = res[0] if len(res) > 0 else [{"text": ""}]
-                    meta_data = res[1] if len(res) > 1 else {}
-            time2 = time.perf_counter()
-
-            asr_result_list.extend(results)
-
-            # batch_data_time = time_per_frame_s * data_batch_i["speech_lengths"].sum().item()
-            batch_data_time = meta_data.get("batch_data_time", -1)
-            time_escape = time2 - time1
-            speed_stats["load_data"] = meta_data.get("load_data", 0.0)
-            speed_stats["extract_feat"] = meta_data.get("extract_feat", 0.0)
-            speed_stats["forward"] = f"{time_escape:0.3f}"
-            speed_stats["batch_size"] = f"{len(results)}"
-            speed_stats["rtf"] = f"{(time_escape) / batch_data_time:0.3f}"
-            description = f"{speed_stats}, "
-            if pbar:
-                pbar.update(end_idx - beg_idx)
-                pbar.set_description(description)
-            time_speech_total += batch_data_time
-            time_escape_total += time_escape
-
-        if pbar:
-            # pbar.update(1)
-            pbar.set_description(f"rtf_avg: {time_escape_total/time_speech_total:0.3f}")
-        torch.cuda.empty_cache()
-        return asr_result_list
-
-    def vad(self, input, input_len=None, **cfg):
-        kwargs = self.kwargs
-        # step.1: compute the vad model
-        deep_update(self.vad_kwargs, cfg)
-        beg_vad = time.time()
-        res = self.inference(
-            input,
-            input_len=input_len,
-            model=self.vad_model,
-            kwargs=self.vad_kwargs,
-            **cfg,
-        )
-        end_vad = time.time()
-        #  FIX(gcf): concat the vad clips for sense vocie model for better aed
-        if cfg.get("merge_vad", False):
-            for i in range(len(res)):
-                res[i]["value"] = merge_vad(
-                    res[i]["value"], kwargs.get("merge_length_s", 15) * 1000
-                )
-        elapsed = end_vad - beg_vad
-        return elapsed, res
-
-    def inference_with_vadres(self, input, vad_res, input_len=None, **cfg):
-
-        kwargs = self.kwargs
-
-        # step.2 compute asr model
-        model = self.model
-        deep_update(kwargs, cfg)
-        batch_size = max(int(kwargs.get("batch_size_s", 300)) * 1000, 1)
-        batch_size_threshold_ms = int(kwargs.get("batch_size_threshold_s", 60)) * 1000
-        kwargs["batch_size"] = batch_size
-
-        key_list, data_list = prepare_data_iterator(
-            input, input_len=input_len, data_type=kwargs.get("data_type", None)
-        )
-        results_ret_list = []
-        time_speech_total_all_samples = 1e-6
-
-        beg_total = time.time()
-        pbar_total = (
-            tqdm(colour="red", total=len(vad_res), dynamic_ncols=True)
-            if not kwargs.get("disable_pbar", False)
-            else None
-        )
-
-        for i in range(len(vad_res)):
-            key = vad_res[i]["key"]
-            vadsegments = vad_res[i]["value"]
-            input_i = data_list[i]
-            fs = kwargs["frontend"].fs if hasattr(kwargs["frontend"], "fs") else 16000
-            speech = load_audio_text_image_video(
-                input_i, fs=fs, audio_fs=kwargs.get("fs", 16000)
-            )
-            speech_lengths = len(speech)
-            n = len(vadsegments)
-            data_with_index = [(vadsegments[i], i) for i in range(n)]
-            sorted_data = sorted(data_with_index, key=lambda x: x[0][1] - x[0][0])
-            results_sorted = []
-
-            if not len(sorted_data):
-                results_ret_list.append({"key": key, "text": "", "timestamp": []})
-                logging.info("decoding, utt: {}, empty speech".format(key))
-                continue
-
-            if len(sorted_data) > 0 and len(sorted_data[0]) > 0:
-                batch_size = max(
-                    batch_size, sorted_data[0][0][1] - sorted_data[0][0][0]
-                )
-
-            if kwargs["device"] == "cpu":
-                batch_size = 0
-
-            beg_idx = 0
-            beg_asr_total = time.time()
-            time_speech_total_per_sample = speech_lengths / 16000
-            time_speech_total_all_samples += time_speech_total_per_sample
-
-            # pbar_sample = tqdm(colour="blue", total=n, dynamic_ncols=True)
-
-            all_segments = []
-            max_len_in_batch = 0
-            end_idx = 1
-
-            for j, _ in enumerate(range(0, n)):
-                # pbar_sample.update(1)
-                sample_length = sorted_data[j][0][1] - sorted_data[j][0][0]
-                potential_batch_length = max(max_len_in_batch, sample_length) * (
-                    j + 1 - beg_idx
-                )
-                # batch_size_ms_cum += sorted_data[j][0][1] - sorted_data[j][0][0]
-                if (
-                    j < n - 1
-                    and sample_length < batch_size_threshold_ms
-                    and potential_batch_length < batch_size
-                ):
-                    max_len_in_batch = max(max_len_in_batch, sample_length)
-                    end_idx += 1
-                    continue
-
-                speech_j, speech_lengths_j, intervals = slice_padding_audio_samples(
-                    speech, speech_lengths, sorted_data[beg_idx:end_idx]
-                )
-                results = self.inference(
-                    speech_j, input_len=None, model=model, kwargs=kwargs, **cfg
-                )
-
-                for _b in range(len(speech_j)):
-                    results[_b]["interval"] = intervals[_b]
-
-                if self.spk_model is not None:
-                    # compose vad segments: [[start_time_sec, end_time_sec, speech], [...]]
-                    for _b in range(len(speech_j)):
-                        vad_segments = [
-                            [
-                                sorted_data[beg_idx:end_idx][_b][0][0] / 1000.0,
-                                sorted_data[beg_idx:end_idx][_b][0][1] / 1000.0,
-                                np.array(speech_j[_b]),
-                            ]
-                        ]
-                        segments = sv_chunk(vad_segments)
-                        all_segments.extend(segments)
-                        speech_b = [i[2] for i in segments]
-                        spk_res = self.inference(
-                            speech_b,
-                            input_len=None,
-                            model=self.spk_model,
-                            kwargs=kwargs,
-                            **cfg,
-                        )
-                        results[_b]["spk_embedding"] = spk_res[0]["spk_embedding"]
-
-                beg_idx = end_idx
-                end_idx += 1
-                max_len_in_batch = sample_length
-                if len(results) < 1:
-                    continue
-                results_sorted.extend(results)
-
-            # end_asr_total = time.time()
-            # time_escape_total_per_sample = end_asr_total - beg_asr_total
-            # pbar_sample.update(1)
-            # pbar_sample.set_description(f"rtf_avg_per_sample: {time_escape_total_per_sample / time_speech_total_per_sample:0.3f}, "
-            #                      f"time_speech_total_per_sample: {time_speech_total_per_sample: 0.3f}, "
-            #                      f"time_escape_total_per_sample: {time_escape_total_per_sample:0.3f}")
-
-            restored_data = [0] * n
-            for j in range(n):
-                index = sorted_data[j][1]
-                cur = results_sorted[j]
-                pattern = r"<\|([^|]+)\|>"
-                emotion_string = re.findall(pattern, cur["text"])
-                cur["text"] = re.sub(pattern, "", cur["text"])
-                cur["emo"] = "".join([f"<|{t}|>" for t in emotion_string])
-                if self.punc_model is not None and len(cur["text"].strip()) > 0:
-                    deep_update(self.punc_kwargs, cfg)
-                    punc_res = self.inference(
-                        cur["text"],
-                        model=self.punc_model,
-                        kwargs=self.punc_kwargs,
-                        **cfg,
-                    )
-                    cur["text"] = punc_res[0]["text"]
-
-                restored_data[index] = cur
-
-            end_asr_total = time.time()
-            time_escape_total_per_sample = end_asr_total - beg_asr_total
-            if pbar_total:
-                pbar_total.update(1)
-                pbar_total.set_description(
-                    f"rtf_avg: {time_escape_total_per_sample / time_speech_total_per_sample:0.3f}, "
-                    f"time_speech: {time_speech_total_per_sample: 0.3f}, "
-                    f"time_escape: {time_escape_total_per_sample:0.3f}"
-                )
-
-        # end_total = time.time()
-        # time_escape_total_all_samples = end_total - beg_total
-        # print(f"rtf_avg_all: {time_escape_total_all_samples / time_speech_total_all_samples:0.3f}, "
-        #                      f"time_speech_all: {time_speech_total_all_samples: 0.3f}, "
-        #                      f"time_escape_all: {time_escape_total_all_samples:0.3f}")
-        return restored_data
-
-    def export(self, input=None, **cfg):
-        """
-
-        :param input:
-        :param type:
-        :param quantize:
-        :param fallback_num:
-        :param calib_num:
-        :param opset_version:
-        :param cfg:
-        :return:
-        """
-
-        device = cfg.get("device", "cpu")
-        model = self.model.to(device=device)
-        kwargs = self.kwargs
-        deep_update(kwargs, cfg)
-        kwargs["device"] = device
-        del kwargs["model"]
-        model.eval()
-
-        type = kwargs.get("type", "onnx")
-
-        key_list, data_list = prepare_data_iterator(
-            input, input_len=None, data_type=kwargs.get("data_type", None), key=None
-        )
-
-        with torch.no_grad():
-            export_dir = export_utils.export(model=model, data_in=data_list, **kwargs)
-
-        return export_dir

+ 0 - 332
tools/sensevoice/fun_asr.py

@@ -1,332 +0,0 @@
-import gc
-import os
-import re
-
-from audio_separator.separator import Separator
-
-os.environ["MODELSCOPE_CACHE"] = "./.cache/funasr"
-os.environ["UVR5_CACHE"] = "./.cache/uvr5-models"
-import json
-import subprocess
-from pathlib import Path
-
-import click
-import torch
-from loguru import logger
-from pydub import AudioSegment
-from silero_vad import get_speech_timestamps, load_silero_vad, read_audio
-from tqdm import tqdm
-
-from fish_speech.utils.file import AUDIO_EXTENSIONS, VIDEO_EXTENSIONS, list_files
-from tools.sensevoice.auto_model import AutoModel
-
-
-def uvr5_cli(
-    audio_dir: Path,
-    output_folder: Path,
-    audio_files: list[Path] | None = None,
-    output_format: str = "flac",
-    model: str = "BS-Roformer-Viperx-1297.ckpt",
-):
-    # ["BS-Roformer-Viperx-1297.ckpt", "BS-Roformer-Viperx-1296.ckpt", "BS-Roformer-Viperx-1053.ckpt", "Mel-Roformer-Viperx-1143.ckpt"]
-    sepr = Separator(
-        model_file_dir=os.environ["UVR5_CACHE"],
-        output_dir=output_folder,
-        output_format=output_format,
-    )
-    dictmodel = {
-        "BS-Roformer-Viperx-1297.ckpt": "model_bs_roformer_ep_317_sdr_12.9755.ckpt",
-        "BS-Roformer-Viperx-1296.ckpt": "model_bs_roformer_ep_368_sdr_12.9628.ckpt",
-        "BS-Roformer-Viperx-1053.ckpt": "model_bs_roformer_ep_937_sdr_10.5309.ckpt",
-        "Mel-Roformer-Viperx-1143.ckpt": "model_mel_band_roformer_ep_3005_sdr_11.4360.ckpt",
-    }
-    roformer_model = dictmodel[model]
-    sepr.load_model(roformer_model)
-    if audio_files is None:
-        audio_files = list_files(
-            path=audio_dir, extensions=AUDIO_EXTENSIONS, recursive=True
-        )
-    total_files = len(audio_files)
-
-    print(f"{total_files} audio files found")
-
-    res = []
-    for audio in tqdm(audio_files, desc="Denoising: "):
-        file_path = str(audio_dir / audio)
-        sep_out = sepr.separate(file_path)
-        if isinstance(sep_out, str):
-            res.append(sep_out)
-        elif isinstance(sep_out, list):
-            res.extend(sep_out)
-    del sepr
-    gc.collect()
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-
-    return res, roformer_model
-
-
-def get_sample_rate(media_path: Path):
-    result = subprocess.run(
-        [
-            "ffprobe",
-            "-v",
-            "quiet",
-            "-print_format",
-            "json",
-            "-show_streams",
-            str(media_path),
-        ],
-        capture_output=True,
-        text=True,
-        check=True,
-    )
-    media_info = json.loads(result.stdout)
-    for stream in media_info.get("streams", []):
-        if stream.get("codec_type") == "audio":
-            return stream.get("sample_rate")
-    return "44100"  # Default sample rate if not found
-
-
-def convert_to_mono(src_path: Path, out_path: Path, out_fmt: str = "wav"):
-    sr = get_sample_rate(src_path)
-    out_path.parent.mkdir(parents=True, exist_ok=True)
-    if src_path.resolve() == out_path.resolve():
-        output = str(out_path.with_stem(out_path.stem + f"_{sr}"))
-    else:
-        output = str(out_path)
-    subprocess.run(
-        [
-            "ffmpeg",
-            "-loglevel",
-            "error",
-            "-i",
-            str(src_path),
-            "-acodec",
-            "pcm_s16le" if out_fmt == "wav" else "flac",
-            "-ar",
-            sr,
-            "-ac",
-            "1",
-            "-y",
-            output,
-        ],
-        check=True,
-    )
-    return out_path
-
-
-def convert_video_to_audio(video_path: Path, audio_dir: Path):
-    cur_dir = audio_dir / video_path.relative_to(audio_dir).parent
-    vocals = [
-        p
-        for p in cur_dir.glob(f"{video_path.stem}_(Vocals)*.*")
-        if p.suffix in AUDIO_EXTENSIONS
-    ]
-    if len(vocals) > 0:
-        return vocals[0]
-    audio_path = cur_dir / f"{video_path.stem}.wav"
-    convert_to_mono(video_path, audio_path)
-    return audio_path
-
-
-@click.command()
-@click.option("--audio-dir", required=True, help="Directory containing audio files")
-@click.option(
-    "--save-dir", required=True, help="Directory to save processed audio files"
-)
-@click.option("--device", default="cuda", help="Device to use [cuda / cpu]")
-@click.option("--language", default="auto", help="Language of the transcription")
-@click.option(
-    "--max_single_segment_time",
-    default=20000,
-    type=int,
-    help="Maximum of Output single audio duration(ms)",
-)
-@click.option("--fsmn-vad/--silero-vad", default=False)
-@click.option("--punc/--no-punc", default=False)
-@click.option("--denoise/--no-denoise", default=False)
-@click.option("--save_emo/--no_save_emo", default=False)
-def main(
-    audio_dir: str,
-    save_dir: str,
-    device: str,
-    language: str,
-    max_single_segment_time: int,
-    fsmn_vad: bool,
-    punc: bool,
-    denoise: bool,
-    save_emo: bool,
-):
-
-    audios_path = Path(audio_dir)
-    save_path = Path(save_dir)
-    save_path.mkdir(parents=True, exist_ok=True)
-
-    video_files = list_files(
-        path=audio_dir, extensions=VIDEO_EXTENSIONS, recursive=True
-    )
-    v2a_files = [convert_video_to_audio(p, audio_dir) for p in video_files]
-
-    if denoise:
-        VOCAL = "_(Vocals)"
-        original_files = [
-            p
-            for p in audios_path.glob("**/*")
-            if p.suffix in AUDIO_EXTENSIONS and VOCAL not in p.stem
-        ]
-
-        _, cur_model = uvr5_cli(
-            audio_dir=audio_dir, output_folder=audio_dir, audio_files=original_files
-        )
-        need_remove = [p for p in audios_path.glob("**/*(Instrumental)*")]
-        need_remove.extend(original_files)
-        for _ in need_remove:
-            _.unlink()
-        vocal_files = [
-            p
-            for p in audios_path.glob("**/*")
-            if p.suffix in AUDIO_EXTENSIONS and VOCAL in p.stem
-        ]
-        for f in vocal_files:
-            fn, ext = f.stem, f.suffix
-
-            v_pos = fn.find(VOCAL + "_" + cur_model.split(".")[0])
-            if v_pos != -1:
-                new_fn = fn[: v_pos + len(VOCAL)]
-                new_f = f.with_name(new_fn + ext)
-                f = f.rename(new_f)
-                convert_to_mono(f, f, "flac")
-                f.unlink()
-
-    audio_files = list_files(
-        path=audio_dir, extensions=AUDIO_EXTENSIONS, recursive=True
-    )
-
-    logger.info("Loading / Downloading Funasr model...")
-
-    model_dir = "iic/SenseVoiceSmall"
-
-    vad_model = "fsmn-vad" if fsmn_vad else None
-    vad_kwargs = {"max_single_segment_time": max_single_segment_time}
-    punc_model = "ct-punc" if punc else None
-
-    manager = AutoModel(
-        model=model_dir,
-        trust_remote_code=False,
-        vad_model=vad_model,
-        vad_kwargs=vad_kwargs,
-        punc_model=punc_model,
-        device=device,
-    )
-
-    if not fsmn_vad and vad_model is None:
-        vad_model = load_silero_vad()
-
-    logger.info("Model loaded.")
-
-    pattern = re.compile(r"_\d{3}\.")
-
-    for file_path in tqdm(audio_files, desc="Processing audio file"):
-
-        if pattern.search(file_path.name):
-            # logger.info(f"Skipping {file_path} as it has already been processed.")
-            continue
-
-        file_stem = file_path.stem
-        file_suffix = file_path.suffix
-
-        rel_path = Path(file_path).relative_to(audio_dir)
-        (save_path / rel_path.parent).mkdir(parents=True, exist_ok=True)
-
-        audio = AudioSegment.from_file(file_path)
-
-        cfg = dict(
-            cache={},
-            language=language,  # "zh", "en", "yue", "ja", "ko", "nospeech"
-            use_itn=False,
-            batch_size_s=60,
-        )
-
-        if fsmn_vad:
-            elapsed, vad_res = manager.vad(input=str(file_path), **cfg)
-        else:
-            wav = read_audio(
-                str(file_path)
-            )  # backend (sox, soundfile, or ffmpeg) required!
-            audio_key = file_path.stem
-            audio_val = []
-            speech_timestamps = get_speech_timestamps(
-                wav,
-                vad_model,
-                max_speech_duration_s=max_single_segment_time // 1000,
-                return_seconds=True,
-            )
-
-            audio_val = [
-                [int(timestamp["start"] * 1000), int(timestamp["end"] * 1000)]
-                for timestamp in speech_timestamps
-            ]
-            vad_res = []
-            vad_res.append(dict(key=audio_key, value=audio_val))
-
-        res = manager.inference_with_vadres(
-            input=str(file_path), vad_res=vad_res, **cfg
-        )
-
-        for i, info in enumerate(res):
-            [start_ms, end_ms] = info["interval"]
-            text = info["text"]
-            emo = info["emo"]
-            sliced_audio = audio[start_ms:end_ms]
-            audio_save_path = (
-                save_path / rel_path.parent / f"{file_stem}_{i:03d}{file_suffix}"
-            )
-            sliced_audio.export(audio_save_path, format=file_suffix[1:])
-            print(f"Exported {audio_save_path}: {text}")
-
-            transcript_save_path = (
-                save_path / rel_path.parent / f"{file_stem}_{i:03d}.lab"
-            )
-            with open(
-                transcript_save_path,
-                "w",
-                encoding="utf-8",
-            ) as f:
-                f.write(text)
-
-            if save_emo:
-                emo_save_path = save_path / rel_path.parent / f"{file_stem}_{i:03d}.emo"
-                with open(
-                    emo_save_path,
-                    "w",
-                    encoding="utf-8",
-                ) as f:
-                    f.write(emo)
-
-        if audios_path.resolve() == save_path.resolve():
-            file_path.unlink()
-
-
-if __name__ == "__main__":
-    main()
-    exit(0)
-    from funasr.utils.postprocess_utils import rich_transcription_postprocess
-
-    # Load the audio file
-    audio_path = Path(r"D:\PythonProject\ok\1_output_(Vocals).wav")
-    model_dir = "iic/SenseVoiceSmall"
-    m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir, device="cuda:0")
-    m.eval()
-
-    res = m.inference(
-        data_in=f"{kwargs['model_path']}/example/zh.mp3",
-        language="auto",  # "zh", "en", "yue", "ja", "ko", "nospeech"
-        use_itn=False,
-        ban_emo_unk=False,
-        **kwargs,
-    )
-
-    print(res)
-    text = rich_transcription_postprocess(res[0][0]["text"])
-    print(text)

+ 0 - 61
tools/sensevoice/vad_utils.py

@@ -1,61 +0,0 @@
-import torch
-from torch.nn.utils.rnn import pad_sequence
-
-
-def slice_padding_fbank(speech, speech_lengths, vad_segments):
-    speech_list = []
-    speech_lengths_list = []
-    for i, segment in enumerate(vad_segments):
-
-        bed_idx = int(segment[0][0] * 16)
-        end_idx = min(int(segment[0][1] * 16), speech_lengths[0])
-        speech_i = speech[0, bed_idx:end_idx]
-        speech_lengths_i = end_idx - bed_idx
-        speech_list.append(speech_i)
-        speech_lengths_list.append(speech_lengths_i)
-    feats_pad = pad_sequence(speech_list, batch_first=True, padding_value=0.0)
-    speech_lengths_pad = torch.Tensor(speech_lengths_list).int()
-    return feats_pad, speech_lengths_pad
-
-
-def slice_padding_audio_samples(speech, speech_lengths, vad_segments):
-    speech_list = []
-    speech_lengths_list = []
-    intervals = []
-    for i, segment in enumerate(vad_segments):
-        bed_idx = int(segment[0][0] * 16)
-        end_idx = min(int(segment[0][1] * 16), speech_lengths)
-        speech_i = speech[bed_idx:end_idx]
-        speech_lengths_i = end_idx - bed_idx
-        speech_list.append(speech_i)
-        speech_lengths_list.append(speech_lengths_i)
-        intervals.append([bed_idx // 16, end_idx // 16])
-
-    return speech_list, speech_lengths_list, intervals
-
-
-def merge_vad(vad_result, max_length=15000, min_length=0):
-    new_result = []
-    if len(vad_result) <= 1:
-        return vad_result
-    time_step = [t[0] for t in vad_result] + [t[1] for t in vad_result]
-    time_step = sorted(list(set(time_step)))
-    if len(time_step) == 0:
-        return []
-    bg = 0
-    for i in range(len(time_step) - 1):
-        time = time_step[i]
-        if time_step[i + 1] - bg < max_length:
-            continue
-        if time - bg > min_length:
-            new_result.append([bg, time])
-        # if time - bg < max_length * 1.5:
-        #     new_result.append([bg, time])
-        # else:
-        #     split_num = int(time - bg) // max_length + 1
-        #     spl_l = int(time - bg) // split_num
-        #     for j in range(split_num):
-        #         new_result.append([bg + j * spl_l, bg + (j + 1) * spl_l])
-        bg = time
-    new_result.append([bg, time_step[-1]])
-    return new_result

+ 0 - 17
tools/vqgan/inference.py

@@ -1,17 +0,0 @@
-import os
-import subprocess
-import sys
-
-#!/usr/bin/env python
-
-
-def main():
-    # Make path relative to this file
-    script_path = os.path.join(
-        os.path.dirname(__file__), "../../fish_speech/models/vqgan/inference.py"
-    )
-    subprocess.run(["python", script_path] + sys.argv[1:])
-
-
-if __name__ == "__main__":
-    main()

+ 1 - 19
tools/webui/__init__.py

@@ -3,7 +3,6 @@ from typing import Callable
 import gradio as gr
 
 from fish_speech.i18n import i18n
-from fish_speech.inference_engine.utils import normalize_text
 from tools.webui.variables import HEADER_MD, TEXTBOX_PLACEHOLDER
 
 
@@ -25,20 +24,6 @@ def build_app(inference_fct: Callable, theme: str = "light") -> gr.Blocks:
                 text = gr.Textbox(
                     label=i18n("Input Text"), placeholder=TEXTBOX_PLACEHOLDER, lines=10
                 )
-                refined_text = gr.Textbox(
-                    label=i18n("Realtime Transform Text"),
-                    placeholder=i18n(
-                        "Normalization Result Preview (Currently Only Chinese)"
-                    ),
-                    lines=5,
-                    interactive=False,
-                )
-
-                with gr.Row():
-                    normalize = gr.Checkbox(
-                        label=i18n("Text Normalization"),
-                        value=False,
-                    )
 
                 with gr.Row():
                     with gr.Column():
@@ -147,14 +132,11 @@ def build_app(inference_fct: Callable, theme: str = "light") -> gr.Blocks:
                             variant="primary",
                         )
 
-        text.input(fn=normalize_text, inputs=[text, normalize], outputs=[refined_text])
-
         # Submit
         generate.click(
             inference_fct,
             [
-                refined_text,
-                normalize,
+                text,
                 reference_id,
                 reference_audio,
                 reference_text,

+ 0 - 2
tools/webui/inference.py

@@ -8,7 +8,6 @@ from fish_speech.utils.schema import ServeReferenceAudio, ServeTTSRequest
 
 def inference_wrapper(
     text,
-    normalize,
     reference_id,
     reference_audio,
     reference_text,
@@ -33,7 +32,6 @@ def inference_wrapper(
 
     req = ServeTTSRequest(
         text=text,
-        normalize=normalize,
         reference_id=reference_id if reference_id else None,
         references=references,
         max_new_tokens=max_new_tokens,