Fish Speech

!!! info "License Notice"

This codebase and its associated model weights are released under **FISH AUDIO RESEARCH LICENSE**. Please refer to [LICENSE](https://github.com/fishaudio/fish-speech/blob/main/LICENSE) for more details.

!!! warning "Legal Disclaimer"

We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

Get Started

This is the official documentation for Fish Speech. Please follow the instructions to get started easily.

Fish Audio S2

The best text-to-speech system in both open-source and closed-source

Fish Audio S2 is the latest model developed by Fish Audio, designed to generate speech that sounds natural, authentic, and emotionally rich—not mechanical, flat, or confined to studio-style reading.

Fish Audio S2 focuses on everyday conversations, supporting native multi-speaker and multi-round generation. It also supports instruction control.

The S2 series includes multiple models. The open-source model is S2-Pro, which is the most powerful model in the series.

Please visit the Fish Audio website for a real-time experience.

Model Variants

Model	Size	Availability	Description
S2-Pro	4B Parameters	huggingface	Full-featured flagship model with the highest quality and stability
S2-Flash	- - - -	fish.audio	Our closed-source model with faster speed and lower latency

For more details on the models, please see the technical report.

Highlights

Natural Language Control

Fish Audio S2 allows users to use natural language to control the performance, paralinguistic information, emotions, and more voice characteristics of each sentence, instead of just using short tags to vaguely control the model's performance. This greatly improves the overall quality of the generated content.

Multilingual Support

Fish Audio S2 supports high-quality multilingual text-to-speech without the need for phonemes or language-specific preprocessing. Including:

English, Chinese, Japanese, Korean, Arabic, German, French...

And more!

The list is constantly expanding, please check Fish Audio for the latest releases.

Native Multi-speaker Generation

Fish Audio S2 allows users to upload reference audio containing multiple speakers, and the model will process each speaker's characteristics through the <|speaker:i|> token. You can then control the model's performance via speaker ID tokens, achieving multiple speakers in a single generation. No more need to upload reference audio and generate speech for each speaker individually.

Multi-round Dialogue Generation

Thanks to the expansion of the model's context, our model can now use the information from the previous context to improve the expressiveness of the subsequent generated content, thereby enhancing the naturalness of the content.

Fast Voice Cloning

Fish Audio S2 supports accurate voice cloning using short reference samples (typically 10-30 seconds). The model can capture timbre, speaking style, and emotional tendency, generating realistic and consistent cloned voices without additional fine-tuning.

Acknowledgements

Technical Report

@misc{fish-speech-v1.4,
      title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
      author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
      year={2024},
      eprint={2411.01156},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2411.01156},
}

index.md 6.7 KB История Исходник