!!! info "License Notice"
This codebase and its associated model weights are released under **FISH AUDIO RESEARCH LICENSE**. Please refer to [LICENSE](https://github.com/fishaudio/fish-speech/blob/main/LICENSE) for more details.
!!! warning "Legal Disclaimer"
We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
This is the official documentation for Fish Speech. Please follow the instructions to get started easily.
The best text-to-speech system in both open-source and closed-source
Fish Audio S2 is the latest model developed by Fish Audio, designed to generate speech that sounds natural, authentic, and emotionally rich—not mechanical, flat, or confined to studio-style reading.
Fish Audio S2 focuses on everyday conversations, supporting native multi-speaker and multi-round generation. It also supports instruction control.
The S2 series includes multiple models. The open-source model is S2-Pro, which is the most powerful model in the series.
Please visit the Fish Audio website for a real-time experience.
| Model | Size | Availability | Description |
|---|---|---|---|
| S2-Pro | 4B Parameters | huggingface | Full-featured flagship model with the highest quality and stability |
| S2-Flash | - - - - | fish.audio | Our closed-source model with faster speed and lower latency |
For more details on the models, please see the technical report.
Fish Audio S2 allows users to use natural language to control the performance, paralinguistic information, emotions, and more voice characteristics of each sentence, instead of just using short tags to vaguely control the model's performance. This greatly improves the overall quality of the generated content.
Fish Audio S2 supports high-quality multilingual text-to-speech without the need for phonemes or language-specific preprocessing. Including:
English, Chinese, Japanese, Korean, Arabic, German, French...
And more!
The list is constantly expanding, please check Fish Audio for the latest releases.
Fish Audio S2 allows users to upload reference audio containing multiple speakers, and the model will process each speaker's characteristics through the <|speaker:i|> token. You can then control the model's performance via speaker ID tokens, achieving multiple speakers in a single generation. No more need to upload reference audio and generate speech for each speaker individually.
Thanks to the expansion of the model's context, our model can now use the information from the previous context to improve the expressiveness of the subsequent generated content, thereby enhancing the naturalness of the content.
Fish Audio S2 supports accurate voice cloning using short reference samples (typically 10-30 seconds). The model can capture timbre, speaking style, and emotional tendency, generating realistic and consistent cloned voices without additional fine-tuning.
@misc{fish-speech-v1.4,
title={Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis},
author={Shijia Liao and Yuxuan Wang and Tianyu Li and Yifan Cheng and Ruoyi Zhang and Rongzhi Zhou and Yijin Xing},
year={2024},
eprint={2411.01156},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2411.01156},
}