VinAIResearch/XPhoneBERT

Text Normalization Process

qwertyflagstop opened this issue · 1 comments

First of all, thanks for putting this up! Maybe I missed it somewhere but can you explain what the text normalization process should be for making new datasets. I know that there word segmentation and text normalization. For things like english i'm assuming that word segmentation just relies on spaces but the text-normalization I'm a bit lost. Here is my best guess, does it look correct?

import re
from num2words import num2words
from nltk.tokenize import word_tokenize

def normalize_text(text): #improvised because I have no idea what the real "normalization" is but this seems to match whats in the sample dataset
    # Convert to lowercase
    text = text.lower()

    # Convert numbers to words
    text = re.sub(r'\b\d+\b', lambda x: num2words(int(x.group())), text)

    # Replace opening and closing quotes
    text = re.sub(r"(\s)\"(\w)", lambda m: m.group(1) + "``" + m.group(2), text)
    text = re.sub(r"(\w)\"(\s)", lambda m: m.group(1) + "''" + m.group(2), text)

    # Tokenize text
    tokens = word_tokenize(text)

    # Join tokens back together with spaces in between
    normalized_text = ' '.join(tokens)

    return normalized_text

Hi @qwertyflagstop . Thanks for your interest!
Text normalization is an important preprocessing before TTS. You can read more here: link .