Нет описания

Leng Yue daa9b4f31c S2 beta (#1164) 1 месяц назад
.github 02d4adcb1d [fix]: fix docker push problem (#1121) 5 месяцев назад
docker daa9b4f31c S2 beta (#1164) 1 месяц назад
docs daa9b4f31c S2 beta (#1164) 1 месяц назад
fish_speech daa9b4f31c S2 beta (#1164) 1 месяц назад
tools daa9b4f31c S2 beta (#1164) 1 месяц назад
.dockerignore daa9b4f31c S2 beta (#1164) 1 месяц назад
.gitignore 67335275cb Fix README.md link typo (#1104) 6 месяцев назад
.pre-commit-config.yaml 7388ca36ca [pre-commit.ci] pre-commit autoupdate (#1135) 3 месяцев назад
.project-root 5707699dfd Handle adaptive number of codebooks 2 лет назад
.readthedocs.yaml fe293ca492 Use readthedocs instead of github action 2 лет назад
API_FLAGS.txt 306ff66d9e fix flags 10 месяцев назад
LICENSE daa9b4f31c S2 beta (#1164) 1 месяц назад
README.md daa9b4f31c S2 beta (#1164) 1 месяц назад
compose.base.yml cccad3e098 Docker overhaul. (#1100) 6 месяцев назад
compose.yml cccad3e098 Docker overhaul. (#1100) 6 месяцев назад
dockerfile.dev 23fa4d7e38 Fix dockerfile for `pyaudio` (#623) 1 год назад
entrypoint.sh 62eae262c2 Make WebUI and API code cleaner (+ 1.5 fixes) (#703) 1 год назад
inference.ipynb cccad3e098 Docker overhaul. (#1100) 6 месяцев назад
mkdocs.yml 781bf1cd7a Finetune support of OpenAudio-S1 (#1115) 5 месяцев назад
pyproject.toml daa9b4f31c S2 beta (#1164) 1 месяц назад
pyrightconfig.json 6d57066e52 Update pre-commit hook 2 лет назад
uv.lock daa9b4f31c S2 beta (#1164) 1 месяц назад

README.md

Fish Speech

**English** | [简体中文](docs/README.zh.md) | [Portuguese](docs/README.pt-BR.md) | [日本語](docs/README.ja.md) | [한국어](docs/README.ko.md) | [العربية](docs/README.ar.md)




[!IMPORTANT] License Notice
This codebase and its associated model weights are released under FISH AUDIO RESEARCH LICENSE. Please refer to LICENSE for more details.

[!WARNING] Legal Disclaimer
We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

Start Here

Here are the official documents for Fish Speech, follow the instructions to get started easily.

Fish Audio S2

Best Text-to-speech system among both open source and closed source

Fish Audio S2 is the latest model developed by Fish Audio, designed to generate speech that sounds natural, realistic, and emotionally rich — not robotic, not flat, and not constrained to studio-style narration.

Fish Audio S2 focuses on daily conversation and dialogue, which enables native multi-speaker and multi-turn generation. Also supports instruction control.

The S2 series contains several models, the open-sourced model is S2-Pro, which is best model in the collection.

Visit the Fish Audio website for live playground.

Model Variants

Model Size Availability Description
S2-Pro 4B parameters huggingface Full-featured flagship model with maximum quality and stability
S2-Flash - - - - fish.audio Our closed source model with faster speed and lower latency

More details of the model can be found in the technical report.

Highlights

Fine-Grained Inline Control via Natural Language

Fish Audio S2 enables localized control over speech generation by embedding natural-language instructions directly at specific word or phrase positions within the text. Rather than relying on a fixed set of predefined tags, S2 accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.

Multilingual Support

Fish Audio S2 supports high-quality multilingual text-to-speech without requiring phonemes or language-specific preprocessing. Including:

English, Chinese, Japanese, Korean, Arabics, German, French...

AND MORE!

The list is constantly expanding, check Fish Audio for the latest releases.

Native multi-speaker generation

Fish Audio S2 allows users to upload reference audio with multi-speaker, the model will deal with every speaker's feature via <|speaker:i|> token. Then you can control the model's performance with the speaker id token, allowing a single generation to include multiple speakers. You no longer need to upload reference audio separately for each speaker.

Multi-turn generation

Thanks to the expansion of the model context, our model can now use previous information to improve the expressiveness of subsequent generated content, thereby increasing the naturalness of the content.

Rapid Voice Cloning

Fish Audio S2 supports accurate voice cloning using a short reference sample (typically 10–30 seconds). The model captures timbre, speaking style, and emotional tendencies, producing realistic and consistent cloned voices without additional fine-tuning.


Credits

Tech Report

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156},
}