- use snac speech audio codec
- train gpt-2 to predict audio tokens auto-regressively
- clone repo
- update submodules
git submodule update --init --recursive
- download snac model using git lfs
cd model/codec/snac/snac_24khz
&git lfs pull
- install dependencies
pip install -r requirements.txt
- data
- add normalizer & tokenizer
- add interface audio -> codes
- data loader to prepare data (audio, text, snac codes)
- get & transform test dataset (500h)
- model
- add snac model
- add gpt2 model
- interface for combining text & audio code as input to sequence model
- add prediction layer (classification heads after sequence model)
- interface for dealing with output from sequence model (loss computation)
- train loop
- api & app to synthesize speech
- use RWKV instead of gpt2 https://github.com/BlinkDL/RWKV-LM
- add speaker style conditioning