Question about Fine-Tuning Script for Improving Own Language Quality

Question

Question about Fine-Tuning Script for Improving Own Language Quality

Closed this issue a month ago · 8 comments

Dear Chatterbox Team,

Thank you very much for releasing such an impressive and advanced open-source TTS model. It is exciting to see the wide range of supported languages.

I wanted to ask if there is a possibility or plan to provide a fine-tuning script that would allow users to improve the quality of the voice model for their own language. Currently, the output sometimes has accent issues and could benefit greatly from additional fine-tuning on targeted datasets.

Is there any script available for fine-tuning or release plans for such a feature in the near future? It would be very helpful for the community to customize and enhance voice quality.

Thank you for your time and consideration. Looking forward to your response.

Answer 1 · 2025-09-05T19:28:36.000Z

Dear Chatterbox Team,

Thank you very much for releasing such an impressive and advanced open-source TTS model. It is exciting to see the wide range of supported languages.

I wanted to ask if there is a possibility or plan to provide a fine-tuning script that would allow users to improve the quality of the voice model for their own language. Currently, the output sometimes has accent issues and could benefit greatly from additional fine-tuning on targeted datasets.

Is there any script available for fine-tuning or release plans for such a feature in the near future? It would be very helpful for the community to customize and enhance voice quality.

Thank you for your time and consideration. Looking forward to your response.

You can check this repo -> https://github.com/JarodMica/chatterbox

Answer 2 · 2025-09-06T11:44:48.000Z

Dear Chatterbox Team,
Thank you very much for releasing such an impressive and advanced open-source TTS model. It is exciting to see the wide range of supported languages.
I wanted to ask if there is a possibility or plan to provide a fine-tuning script that would allow users to improve the quality of the voice model for their own language. Currently, the output sometimes has accent issues and could benefit greatly from additional fine-tuning on targeted datasets.
Is there any script available for fine-tuning or release plans for such a feature in the near future? It would be very helpful for the community to customize and enhance voice quality.
Thank you for your time and consideration. Looking forward to your response.

You can check this repo -> https://github.com/JarodMica/chatterbox

I'm familiar with this repository, but I'm unsure how to load pre-trained models for my language. Specifically, I want to use multilingual.safeensors instead of just cfg 3. If it loads the original models instead of the multilingual ones, there would be no point. I simply want to fine-tune my existing model for my language...

Answer 3 · 2025-10-07T16:45:58.000Z

@Juranik hope it helps https://github.com/99eren99/chatterbox-multilingual-finetuning

Answer 4 · 2025-10-07T18:47:33.000Z

@Juranik hope it helps https://github.com/99eren99/chatterbox-multilingual-finetuning

Is this multilanguage training? Looks like single language. Am I missing something?

 with open(metadata_path, "r", encoding="utf-8") as f:
                for line_idx, line in enumerate(f):
                    parts = line.strip().split("|")
                    if len(parts) != 2:
                        parts = line.strip().split("\t")
                    if len(parts) == 2:
                        audio_file, text = parts
                        audio_path = (
                            Path(audio_file)
                            if Path(audio_file).is_absolute()
                            else dataset_root / audio_file
                        )
                        if audio_path.exists():
                            all_files.append({"audio": str(audio_path), "text": text})

Answer 5 · 2025-10-07T19:31:28.000Z

@Juranik hope it helps https://github.com/99eren99/chatterbox-multilingual-finetuning

Is this multilanguage training? Looks like single language. Am I missing something?

 with open(metadata_path, "r", encoding="utf-8") as f:
                for line_idx, line in enumerate(f):
                    parts = line.strip().split("|")
                    if len(parts) != 2:
                        parts = line.strip().split("\t")
                    if len(parts) == 2:
                        audio_file, text = parts
                        audio_path = (
                            Path(audio_file)
                            if Path(audio_file).is_absolute()
                            else dataset_root / audio_file
                        )
                        if audio_path.exists():
                            all_files.append({"audio": str(audio_path), "text": text})

it's monolingual finetuning for the multilingual model

change is here:

raw_text_tokens = self.text_tokenizer.text_to_tokens(
            normalized_text, language_id=self.data_args.language
        ).squeeze(0)

in addition, i replaced chatterbox source files with the ones from multilingual-chatterbox huggingface space and added a custom method "from_checkpoint" for loading the model from ckpt dir

Answer 6 · 2025-10-11T07:17:50.000Z

@Juranik hope it helps https://github.com/99eren99/chatterbox-multilingual-finetuning

Is this multilanguage training? Looks like single language. Am I missing something?
 with open(metadata_path, "r", encoding="utf-8") as f:
                for line_idx, line in enumerate(f):
                    parts = line.strip().split("|")
                    if len(parts) != 2:
                        parts = line.strip().split("\t")
                    if len(parts) == 2:
                        audio_file, text = parts
                        audio_path = (
                            Path(audio_file)
                            if Path(audio_file).is_absolute()
                            else dataset_root / audio_file
                        )
                        if audio_path.exists():
                            all_files.append({"audio": str(audio_path), "text": text})
it's monolingual finetuning for the multilingual model

change is here:
raw_text_tokens = self.text_tokenizer.text_to_tokens(
            normalized_text, language_id=self.data_args.language
        ).squeeze(0)
in addition, i replaced chatterbox source files with the ones from multilingual-chatterbox huggingface space and added a custom method "from_checkpoint" for loading the model from ckpt dir

VERY BIG THANK YOU sincerely for your help!!!
Could you please clarify the correct way to use a custom tokenizer during training and generation?

Specifically:
If I have my own custom or slightly modified tokenizer, is it necessary to integrate and use it during the actual training process, or would it also be possible to replace the default tokenizer with my custom version only after training has finished — and have it work correctly for speech generation? In other words, do I need to fine-tune the model together with my tokenizer to ensure stable outputs, or will swapping in a new tokenizer after fine-tuning produce valid results anyway?

Thank you for your insights and explanation!

Answer 7 · 2025-10-11T08:42:48.000Z

@Juranik Unless you have thousands of hours of audio data or you're working with a different alphabet that isn't included in the default grapheme-style tokenizer, I don't think changing the tokenizer will end well. Modifying the tokenizer puts you somewhere between training from scratch and fine-tuning.

To change the tokenizer for Chatterbox Multilingual:

Your tokenizer must support voice token IDs -> s3 tokenizer.
You'll need to modify and verify all tokenizer-related parts in the repository -> just search for the term "tokenizer".

If you do have thousands of hours of voice data, a better option might be training an Orpheus-type model, since it operates at a lower level of abstraction. Lower abstraction makes it easier to manipulate the tokenizer.

Answer 8 · 2025-10-12T17:52:19.000Z

@Juranik Unless you have thousands of hours of audio data or you're working with a different alphabet that isn't included in the default grapheme-style tokenizer, I don't think changing the tokenizer will end well. Modifying the tokenizer puts you somewhere between training from scratch and fine-tuning.

To change the tokenizer for Chatterbox Multilingual:

Your tokenizer must support voice token IDs -> s3 tokenizer.

You'll need to modify and verify all tokenizer-related parts in the repository -> just search for the term "tokenizer".

If you do have thousands of hours of voice data, a better option might be training an Orpheus-type model, since it operates at a lower level of abstraction. Lower abstraction makes it easier to manipulate the tokenizer.

Thank yoy veyr very very much brother, GOD BLESS YOU!