Persian Tacotron2

Persian Tacotron2 is a customized implementation of Tacotron2, adapted for Persian text-to-speech (TTS) synthesis. Tacotron2 is a model that converts text into mel-spectrograms, which can then be synthesized into audio. This implementation builds upon NVIDIA's Tacotron2 with adjustments for Persian phoneme-based data.

Modifications for Persian Language

To adapt Tacotron2 for Persian, the following changes were made:

Data Preparation: Persian data is organized into audio files and corresponding phoneme sequences (using phonemes avoids issues related to Persian script and vowel omissions).
Cleaner Modification: Edited cleaner.py in tacotron2/text/ to handle Persian phonemes.
Hyperparameter Adjustment: Customized hparams.py in tacotron2/ for Persian language data.
Data File Creation: Created a script to format data into text files for training.
Testing Script: Added a script for testing the model on specific phoneme sequences.

How to Use

Setup

Clone the Repository

git clone https://github.com/your_username/persian_tacotron.git
cd persian_tacotron

Install Requirements

pip install -r tacotron2/requirements.txt

Prepare Your Data
- Place audio files in files/wavs
- Add phoneme transcriptions in files/phoneme_transcriptions.txt
Create Data Files Run the data preparation script:
```
python create_data_file.py
```
This will generate text files in files/text_files/. Move these files to tacotron2/filelists/ for training.
Configure Hyperparameters Modify hparams.py in tacotron2/ to set parameters like epochs, iters_per_checkpoint, training_files, and validation_files paths.

Training

Start Training Begin training with:
```
python tacotron2/train.py --output_directory=outdir --log_directory=logdir
```
Checkpoints are saved in tacotron2/outdir/. For instance, with 1000 audio files and a batch size of 16, each epoch will include approximately 1000/16 iterations. If you encounter memory issues, reduce the batch_size in hparams.py.
Test the Model

Update get_results.py with the phoneme sequence you’d like to test (text = "YOUR_TEST_PHONEME"). Run inference with the latest checkpoint. For example:
```
python get_results.py 32000
```
Outputs (mel-spectrograms and audio files) will be saved in results/.

Results

Training the model on 2500 audio files for 400 epochs produced the following results: