Persian Tacotron2 is a customized implementation of Tacotron2, adapted for Persian text-to-speech (TTS) synthesis. Tacotron2 is a model that converts text into mel-spectrograms, which can then be synthesized into audio. This implementation builds upon NVIDIA's Tacotron2 with adjustments for Persian phoneme-based data.
To adapt Tacotron2 for Persian, the following changes were made:
- Data Preparation: Persian data is organized into audio files and corresponding phoneme sequences (using phonemes avoids issues related to Persian script and vowel omissions).
- Cleaner Modification: Edited
cleaner.py
intacotron2/text/
to handle Persian phonemes. - Hyperparameter Adjustment: Customized
hparams.py
intacotron2/
for Persian language data. - Data File Creation: Created a script to format data into text files for training.
- Testing Script: Added a script for testing the model on specific phoneme sequences.
- Clone the Repository
git clone https://github.com/your_username/persian_tacotron.git cd persian_tacotron
- Install Requirements
pip install -r tacotron2/requirements.txt
- Prepare Your Data
- Place audio files in files/wavs
- Add phoneme transcriptions in
files/phoneme_transcriptions.txt
- Create Data Files
Run the data preparation script:
This will generate text files in
python create_data_file.py
files/text_files/
. Move these files totacotron2/filelists/
for training. - Configure Hyperparameters
Modify hparams.py in
tacotron2/
to set parameters like epochs, iters_per_checkpoint, training_files, and validation_files paths.
-
Start Training Begin training with:
python tacotron2/train.py --output_directory=outdir --log_directory=logdir
Checkpoints are saved in
tacotron2/outdir/
. For instance, with 1000 audio files and a batch size of 16, each epoch will include approximately 1000/16 iterations. If you encounter memory issues, reduce the batch_size in hparams.py. -
Test the Model
Update
get_results.py
with the phoneme sequence you’d like to test (text = "YOUR_TEST_PHONEME"). Run inference with the latest checkpoint. For example:python get_results.py 32000
Outputs (mel-spectrograms and audio files) will be saved in results/.
Training the model on 2500 audio files for 400 epochs produced the following results:
Click [here](https://github.com/majidAdibian77/persian_tacotron/tree/master/result/wavs/) for sample audio results.