Fish Speech

**English** | [简体中文](docs/README.zh.md) | [Portuguese](docs/README.pt-BR.md) | [日本語](docs/README.ja.md) | [한국어](docs/README.ko.md)

This codebase is released under Apache License and all model weights are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details. We are excited to announce that we have changed our name into OpenAudio, this will be a brand new series of Text-to-Speech model. Demo available at [Fish Audio Playground](https://fish.audio). Visit the [OpenAudio website](https://openaudio.com) for blog & tech report. ## Features ### OpenAudio-S1 (Fish-Speech's new verison) 1. This model has **ALL FEATURES** that fish-speech had. 2. OpenAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis: (angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused) Also supports tone marker: (in a hurry tone) (shouting) (screaming) (whispering) (soft tone) There's a few special markers that are supported: (laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing) You can also use **Ha,ha,ha** to control, there's many other cases waiting to be explored by yourself. 3. The OpenAudio S1 includes the following sizes: - **S1 (4B, proprietary):** The full-sized model. - **S1-mini (0.5B, open-sourced):** A distilled version of S1. Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF). 4. Evaluations **Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):** - **S1:** - WER (Word Error Rate): **0.008** - CER (Character Error Rate): **0.004** - Distance: **0.332** - **S1-mini:** - WER (Word Error Rate): **0.011** - CER (Character Error Rate): **0.005** - Distance: **0.380** ## Disclaimer We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws. ## Videos #### To be continued. ## Documents - [Build Envrionment](docs/en/install.md) - [Inference](docs/en/inference.md) It should be noted that the current model **DOESN'T SUPPORT FINETUNE**. ## Credits - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) - [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) - [GPT VITS](https://github.com/innnky/gpt-vits) - [MQTTS](https://github.com/b04901014/MQTTS) - [GPT Fast](https://github.com/pytorch-labs/gpt-fast) - [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS) - [Qwen3](https://github.com/QwenLM/Qwen3) ## Tech Report (V1.4) ```bibtex @misc{fish-speech-v1.4, title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis}, author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing}, year={2024}, eprint={2411.01156}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2411.01156}, } ```