Text Normalization Process
qwertyflagstop opened this issue · 1 comments
qwertyflagstop commented
First of all, thanks for putting this up! Maybe I missed it somewhere but can you explain what the text normalization process should be for making new datasets. I know that there word segmentation and text normalization. For things like english i'm assuming that word segmentation just relies on spaces but the text-normalization I'm a bit lost. Here is my best guess, does it look correct?
import re
from num2words import num2words
from nltk.tokenize import word_tokenize
def normalize_text(text): #improvised because I have no idea what the real "normalization" is but this seems to match whats in the sample dataset
# Convert to lowercase
text = text.lower()
# Convert numbers to words
text = re.sub(r'\b\d+\b', lambda x: num2words(int(x.group())), text)
# Replace opening and closing quotes
text = re.sub(r"(\s)\"(\w)", lambda m: m.group(1) + "``" + m.group(2), text)
text = re.sub(r"(\w)\"(\s)", lambda m: m.group(1) + "''" + m.group(2), text)
# Tokenize text
tokens = word_tokenize(text)
# Join tokens back together with spaces in between
normalized_text = ' '.join(tokens)
return normalized_text
thelinhbkhn2014 commented
Hi @qwertyflagstop . Thanks for your interest!
Text normalization is an important preprocessing before TTS. You can read more here: link .