Açıklama Yok

Lengyue 7ae872018f Add arg to disable g2p 2 yıl önce
data_server 57755e1e74 Add rust data server 2 yıl önce
fish_speech 7ae872018f Add arg to disable g2p 2 yıl önce
tools 7ae872018f Add arg to disable g2p 2 yıl önce
.dockerignore 57755e1e74 Add rust data server 2 yıl önce
.gitignore f5919b7e71 Update finetune toolchain 2 yıl önce
.pre-commit-config.yaml f1ea577c4c [pre-commit.ci] pre-commit autoupdate 2 yıl önce
LICENSE 62710e34a4 Add llama inference tool chain 2 yıl önce
README.md 3f7b8215bc Better param & instruction 2 yıl önce
README.zh.md 3f7b8215bc Better param & instruction 2 yıl önce
dockerfile f8cfbb4ac0 Greatly improved data loading speed 2 yıl önce
pyproject.toml 57755e1e74 Add rust data server 2 yıl önce
pyrightconfig.json f7f2c03282 Support pytorch lightning 2 yıl önce

README.md

Fish Speech

Documentation is under construction, English is not fully supported yet.

中文文档

This codebase is released under BSD-3-Clause License, and all models are released under CC-BY-NC-SA-4.0 License. Please refer to LICENSE for more details.

Disclaimer

We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.

Requirements

  • GPU memory: 2GB (for inference), 24GB (for finetuning)
  • System: Linux (full functionality), Windows (inference only, flash-attn is not supported, torch.compile is not supported)

Therefore, we strongly recommend to use WSL2 or docker to run the codebase for Windows users.

Setup

# Basic environment setup
conda create -n fish-speech python=3.10
conda activate fish-speech
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

# Install flash-attn (for linux)
pip3 install ninja && MAX_JOBS=4 pip3 install flash-attn --no-build-isolation

# Install fish-speech
pip3 install -e .

Inference (CLI)

Download required vqgan and text2semantic model from our huggingface repo.

wget https://huggingface.co/fishaudio/speech-lm-v1/raw/main/vqgan-v1.pth -O checkpoints/vqgan-v1.pth
wget https://huggingface.co/fishaudio/speech-lm-v1/blob/main/text2semantic-400m-v0.1-4k.pth -O checkpoints/text2semantic-400m-v0.1-4k.pth

Generate semantic tokens from text:

python tools/llama/generate.py \
    --text "Hello" \
    --num-samples 2 \
    --compile

You may want to use --compile to fuse cuda kernels faster inference (~25 tokens/sec -> ~300 tokens/sec).

Generate vocals from semantic tokens:

python tools/vqgan/inference.py -i codes_0.npy

Rust Data Server

Since loading and shuffle the dataset is very slow and memory consuming, we use a rust server to load and shuffle the dataset. The server is based on GRPC and can be installed by

cd data_server
cargo build --release

Credits