character limit of 273 for language 'fr'" error
Vodou4460 opened this issue · 14 comments
Dear aedocw,
I've been using your epub2tts script with CoquiTTS_XTTS for French language processing and encountered a couple of issues. Specifically, I frequently ran into the "character limit of 273 for language 'fr'" error and faced problems with empty data. These seemed to stem from processing text segments that were too lengthy for the XTTS system.
To address these, I experimented with two main modifications:
1. Modification of the combine_sentences
Function in epub2tts.py
:
Rather than combining sentences into longer segments, I tweaked this function to yield each sentence individually. This approach helps in managing the character limit more effectively. Here’s the adjusted function:
def combine_sentences(self, sentences, length=1000):
for sentence in sentences:
yield sentence
2. Preprocessing the Text with a Custom Function:
Additionally, I crafted a separate function for text preparation. This function employs regular expressions to split the text into sentences and then shortens each sentence to fit within the character limit. It also handles the replacement of certain characters for text cleanup.
import re
import datetime
def split_sentences(text):
# Using regular expression to find split points
parts = re.split(r'(?<![A-Z])([\.|\?|\!])\s', text)
# Reconstructing sentences with punctuation characters
sentences = []
for i in range(0, len(parts)-1, 2):
sentences.append(parts[i] + parts[i+1])
print(f"Number of detected sentences: {len(sentences)}")
return sentences
def shorten_sentences(sentences, max_length):
new_sentences = []
for sentence in sentences:
while len(sentence) > max_length:
# Finding the last punctuation mark
cut_point = max(sentence.rfind(',', 0, max_length), sentence.rfind(';', 0, max_length))
# If no punctuation mark is found, look for a space
if cut_point <= 0:
cut_point = sentence.rfind(' ', 0, max_length)
if cut_point > 0:
new_sentences.append(sentence[:cut_point+1].strip() + '.')
sentence = sentence[cut_point+1:].strip()
else:
new_sentences.append(sentence[:max_length].strip() + '.')
sentence = sentence[max_length:].strip()
new_sentences.append(sentence.strip() + '.' if not sentence.endswith('.') else sentence.strip())
print(f"Total number of sentences after shortening: {len(new_sentences)}")
return new_sentences
def replace_characters(text, replace_chars, new_chars):
for old, new in zip(replace_chars, new_chars):
text = text.replace(old, new)
return text
def save_to_file(lines, original_filename, max_length):
now = datetime.datetime.now()
timestamp = now.strftime("%Y-%m-%d_%H-%M-%S")
basename = original_filename.split('.')[0]
ext = original_filename.split('.')[1]
new_filename = f"{timestamp}_{basename}_split_{max_length}.{ext}"
with open(new_filename, 'w', encoding='utf-8') as new_file:
for line in lines:
new_file.write(line + '\n')
print(f"Writing sentence (length {len(line)}): {line[:50]}...")
return new_filename
def split_and_save_text_v9(original_filename, max_length=300):
with open(original_filename, 'r') as file:
text = file.read()
sentences = split_sentences(text)
short_sentences = shorten_sentences(sentences, max_length)
modified_text = [replace_characters(sentence, ["- ", "\n\n"], ["", ".\n\n"]) for sentence in short_sentences]
return save_to_file(modified_text, original_filename, max_length)
Functionality Explanation:
-
split_sentences function: This splits a given text into sentences using a regular expression. It looks for punctuation marks like
.
,?
, or!
and ensures that these marks are not preceded by a capital letter (to avoid splitting at abbreviations). -
shorten_sentences function: It shortens sentences to a specified maximum length. If a sentence is longer than the maximum length, it looks for a suitable point to split the sentence, preferably at a comma or semicolon, or else at a space. Each new sentence is ended with a period.
-
replace_characters function: This replaces specified characters in the text. It's useful for cleaning up the text or ensuring consistency in formatting.
-
save_to_file function: This function saves the modified sentences to a new file. The new file's name includes a timestamp for easy identification. It prints part of each sentence as it's saved to provide a progress update.
-
split_and_save_text_v9 function: This is the main function that orchestrates the process. It reads the text from a file, splits the text into sentences, shortens the sentences if necessary, replaces certain characters, and then saves the modified sentences to a new file. The maximum sentence length can be specified, with a default value of 300 characters.
# Use this function to process your file
new_filename = split_and_save_text_v9("psychotherapie-de-la-dissociation-et-du-trauma.txt")
print(f"New file created: {new_filename}")
This preprocessing ensures that each sentence inputted into the combine_sentences
function conforms to the character limits imposed by CoquiTTS_XTTS, greatly reducing errors and enhancing the text-to-speech process for French.
While my solution is not perfect and can be considered a makeshift "Bricolage," I wanted to share it with you. I believe you might find a much better solution, and I am eager to see how this can be further improved.
Best regards,
At a quick glance, this looks really great! I will take a more careful look through all this, but it looks like it really could help when the source is a text file. Detecting sentences reliably is not easy, that's why I ended up using NLTK (natural language tool kit). I like your approach though, and will play around with it some. I would also really like to figure out how to make a good guess at separating text files into chapters so they get useful "part" splits, but other than just trying to match on CHAPTER ##, I am not really sure how else to approach it (and obviously that won't work if the text file doesn't have the explicit word "chapter" at the start of each chapter).
Thank you again for using this and helping to make it better, I really appreciate it!
Thank you for your enthusiastic response. I understand the appeal of using NLTK for natural language processing, but I've encountered some challenges with this method in my files. Therefore, I propose an alternative that I believe could be more adaptable and less complex.
My idea is to convert Epub files into text , CSV or MD files. These formats are easily editable and allow for the manual or semi-automatic insertion of chapter markers. This method would involve searching for and replacing specific titles or formats with predefined tags, inspired by the Markdown format. For example, "Chapter 1" could be replaced with "# Chapter 1" to clearly indicate the start of a new chapter.
Following our conversation, I've also been thinking about integrating automated preprocessing into the main script. The idea would be to allow maximum flexibility: those looking for a quick and direct solution could opt for the integrated automation, while those who wish to further customize the file could use a separate script for manual preprocessing.
I believe this semi-automatic approach, combined with the option of automated or manual preprocessing, offers significant flexibility, particularly in adapting to different authors' styles and languages. It also allows for manual intervention for those who wish to further customize the structure of their files.
I hope these proposals will be useful for the project, and I am open to any collaboration to further develop these ideas.
Hello @Vodou4460
I've been playing with your script and it works really great. I've made one change to have a more "human" reading experience.
I added a silence after each sentence, because there was no pause between each one, and it sounded strange compared to the "natural" pauses in the sentences.
Because I'm using xtts I've changed two lines in epub2tts.py, in read_chunk_xtts
function
line 281 if i < len(sentence_list)-1:
to if i < (sentence_list):
to apply a silence after the last sentence.
line 282 changed the multiplier for the silence duration from 1.0 to 0.6 (1 sec sounds too long to me).
Excuse me if this can be done in a better place or in a more elegant way. I'm not a Python programmer.
If you keep improving your function, please tell us!
@aedocw please consider integrate the "split sentence way" in your app. This combination works really great!
I pushed up a branch that incorporates this suggestion, and it works well on a very small sample I tried. I'm going to try it with a full book before merging, but I think this is a nice improvement and I'm glad you suggested it!
Thank you very much. I've been busy playing with this, and learning some Python to try to understand how things work and how to get it work better. I'm now testing this code (please excuse my coding). Tomorrow I'll try some more things, like quotes, double quotes, dialogs punctuation marks (-, .-) ... by now I have this (using @Vodou4460 code)
import re
import datetime
import fire
def reformat_line(line):
line = line.strip()
if not line.endswith("."):
line += "."
return line
def split_sentence(line):
# Using regular expression to find split points
parts = re.split(r'(?<![A-Z])([\.|\?|\!])\s', line)
# Reconstructing sentences with punctuation characters
sentences = []
# this is a mess, but is the only way I've found to make it work
if len(parts)>1:
for i in range(0, len(parts)-1, 2):
sentences.append((parts[i] + parts[i+1]).strip())
sentences.append((parts[len(parts)-1]).strip())
else:
sentences.append(parts[0].strip())
return sentences
def shorten_sentence(sentence, max_length):
sentences = []
while len(sentence)>max_length:
# find "secondary" puntuation marks, if not, space, if not just cut in max_length
if (cut_point := max(sentence.rfind(',', 0, max_length),
sentence.rfind(';', 0, max_length),
sentence.rfind(':', 0, max_length)))<=0:
if (cut_point := sentence.rfind(' ', 0, max_length))<=0:
cut_point=max_length
sentences.append(sentence[:cut_point+1].strip())
# rest of sentence
sentence = sentence[cut_point+1:].strip()
sentences.append(sentence)
return sentences
def save_to_file(lines,original_filename,max_length):
now = datetime.datetime.now()
timestamp = now.strftime("%Y-%m-%d_%H-%M-%S")
basename = original_filename.split('.')[0]
ext = original_filename.split('.')[1]
new_filename = f"{timestamp}_{basename}_split_{max_length}.{ext}"
with open(new_filename, 'w', encoding='utf-8') as new_file:
for line in lines:
new_file.write(line + '\n')
return new_filename
def split_and_save_text(original_filename, max_length=239):
with open(original_filename, 'r') as file:
text = file.readlines()
# "normalize" text, delete empty lines, end all lines with "."
# because only lines ended with '.' generate a pause after them
# Made this because things like:
#
# "Don't explain your philosophy. Embody it."
# ― Epictetus
#
# was "joined" with the next line in text. The oposite for lines
# processed in shorten_sentence
text3 = [reformat_line(line) for line in text if line.strip()]
# split sentences in "primary" punctuation signs
text2=[]
for line in text3:
if line.startswith('#'):
text2.append(line)
else:
lines=split_sentence(line)
text2 += lines
# split sentences longer than max_length in "seconday" puntuation
# or in space
text3=[]
for line in text2:
if len(line)<=max_length:
text3.append(line)
else:
for line2 in shorten_sentence(line,max_length):
text3.append(line2)
print(save_to_file(text3,original_filename,max_length))
if __name__ == "__main__":
fire.Fire(split_and_save_text)
you can use it python3 thisprogram.py textfile.txt max_length
(I set max_length default to 239 because it is the max for Spanish).
Until now it's working very well, because it is able to split sentences without punctuation marks and it sounds "natural" when playing all together, and making the mods in your code that I wrote about makes the reading sound really great.
I'm using it only for XTTS.
Once again, thank you very much!
I pushed up a branch that incorporates this suggestion, and it works well on a very small sample I tried. I'm going to try it with a full book before merging, but I think this is a nice improvement and I'm glad you suggested it!
Hi. Does this branch do the split in sentences? Or should I keep using the code above to split and then process using this branch?
This branch splits into sentences, but it would be worth trying it with --debug
and looking at the file debug.txt
to see if it is splitting where you expect it to. The output will have one line for each sentence it sends to TTS.
This branch splits into sentences, but it would be worth trying it with
--debug
and looking at the filedebug.txt
to see if it is splitting where you expect it to. The output will have one line for each sentence it sends to TTS.
Ok. I will give it a go tomorrow, comparing with current results. Thanks
Hello. I've tested the sentences-pause branch with two test text I have.
The beginning is
Capítulo 1º
Es imposible que un hombre aprenda lo que cree que ya sabe. La dificultad muestra lo que son los hombres
Epicteto
Tal vez te hayas topado con una cita inteligente de un antiguo filósofo estoico o hayas leído un artículo que compartía algunas ideas estoicas inspiradoras. Tal vez un amigo te haya hablado de esa antigua filosofía útil y próspera o hayas estudiado un libro o dos sobre el estoicismo. O, tal vez, aunque hay muy pocas probabilidades, nunca hayas oído hablar de ella.
The only thing I've found is that it joins the 2nd, 3rd and 4th lines, reading
Capítulo 1º
Es imposible que un hombre aprenda lo que cree que ya sabe. La dificultad muestra lo que son los hombres Epicteto Tal vez te hayas topado con una cita inteligente de un antiguo filósofo estoico o hayas leído un artículo blah blah blah.....
The rest worked ok. The output:
Computing speaker latents...
Reading from 1 to 1
0%| | 0/24 [00:00<?, ?it/s]Capítulo 1º,
------------------------------------------------------
Free memory : 4.304443 (GigaBytes)
Total memory: 7.921936 (GigaBytes)
Requested memory: 0.335938 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
WorkSpace: 0x7f4bd6000000
------------------------------------------------------
Time to first chunck: 1.2127406597137451
Received chunk 0 of audio length 24576
Skipping whisper transcript comparison
4%|██████▍ | 1/24 [00:01<00:29, 1.28s/it
Es imposible que un hombre aprenda lo que cree que ya sabe,
Time to first chunck: 0.9801130294799805
Received chunk 0 of audio length 51200
Skipping whisper transcript comparison
8%|████████████▊ | 2/24
[00:02<00:25, 1.15s/it] La dificultad muestra lo que son los hombres
Epicteto
Tal vez te hayas topado con una cita inteligente de un antiguo filósofo estoico o hayas leído un artículo que compartía algunas ideas estoicas inspiradoras,
Time to first chunck: 1.231780767440796
Received chunk 0 of audio length 65792
Time to first chunck: 2.5522217750549316
Received chunk 0 of audio length 66816
Time to first chunck: 4.093189477920532
Received chunk 0 of audio length 66816
Time to first chunck: 5.565227270126343
Received chunk 0 of audio length 66816
Time to first chunck: 6.066920280456543
Received chunk 0 of audio length 10240
Skipping whisper transcript comparison
12%|███████████████████▏ | 3/24
[00:08<01:14, 3.57s/it]
It is something I've detected. The lines must end with a punctuation sign. That is why I do:
def reformat_line(line):
line = line.strip()
if not line.endswith("."):
line += "."
return line
This is redundant if line ends with ",","?" or any other punctuation sign, so maybe something like
line = line.strip()
if not line[-1] in [".", "!", "?",","]:
line += ","
could work.
Hello,
If I update the installation of epub2tts on my machine, will these enhancements be automatically made available?
Please share a sample if you are able to reproduce this error with the current release. I think since we now only send one sentence at a time to TTS, this issue is resolved now.