# Fish Speech **Documentation is under construction** [中文文档](README.zh.md) This codebase is released under BSD-3-Clause License, and all models are released under CC-BY-NC-SA-4.0 License. Please refer to [LICENSE](LICENSE) for more details. ## Disclaimer We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws. ## Requirements - GPU memory: 4GB (for inference), 24GB (for finetuning) - System: Linux (full functionality), Windows (inference only, flash-attn is not supported, torch.compile is not supported) Therefore, we strongly recommend to use WSL2 or docker to run the codebase for Windows users. ## Setup ```bash # Basic environment setup conda create -n fish-speech python=3.10 conda activate fish-speech conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia # Install flash-attn (for linux) pip3 install ninja && MAX_JOBS=4 pip3 install flash-attn --no-build-isolation # Install fish-speech pip3 install -e . ``` ## Inference (CLI) Download required `vqgan` and `text2semantic` model from our huggingface repo. ```bash TODO ``` Generate semantic tokens from text: ```bash python tools/llama/generate.py ``` You may want to use `--compile` to fuse cuda kernels faster inference (~25 tokens/sec -> ~300 tokens/sec). Generate vocals from semantic tokens: ```bash python tools/vqgan/inference.py -i codes_0.npy ``` ## Rust Data Server Since loading and shuffle the dataset is very slow and memory consuming, we use a rust server to load and shuffle the dataset. The server is based on GRPC and can be installed by ```bash cd data_server cargo build --release ``` ## Credits - [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2) - [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2) - [GPT VITS](https://github.com/innnky/gpt-vits) - [MQTTS](https://github.com/b04901014/MQTTS) - [GPT Fast](https://github.com/pytorch-labs/gpt-fast)