punctuation missing

Question

punctuation missing

thanhlong1997 opened this issue a year ago · 1 comments

thanhlong1997 commented a year ago

Hi, I am trying Xphonebert for Vietnamese TTS system and I find that Xphonebert is simply skip punctuation character when convert input sequence to phoneme sequence by using Text2PhonemeSequence library
For example:

import torch
from transformers import AutoModel, AutoTokenizer
from text2phonemesequence import Text2PhonemeSequence

# Load XPhoneBERT model and its tokenizer
xphonebert = AutoModel.from_pretrained("vinai/xphonebert-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/xphonebert-base")

# Load Text2PhonemeSequence
text2phone_model = Text2PhonemeSequence(language="vie-n", is_cuda=False)

# Input sequence that is already word-segmented (and text-normalized if applicable)
sentence1 = "dù sao tiền cũng đã trả rồi, chờ xem phản ứng từ thị trường thế nào đã rồi nói tiếp."
sentence2 = "dù sao tiền cũng đã trả rồi. chờ xem phản ứng từ thị trường thế nào đã rồi nói tiếp."
input_phonemes1 = text2phone_model.infer_sentence(sentence1)
input_phonemes2 = text2phone_model.infer_sentence(sentence2)

The phoneme sequence of 2 input is the same:

z u ˧˨ ▁ s a w ˧˧ ▁ t i ə n ˧˨ ▁ k u ŋ͡m ˧ˀ˥ ▁ d a ˧ˀ˥ ▁ c a ˧˩˨ ▁ z o j ˧˨ ▁ c ɤ ˧˨ ▁ s ɛ m ˧˧ ▁ f a n ˧˩˨ ▁ ɯ ŋ ˨˦ ▁ t ɯ ˧˨ ▁ tʰ i ˨ˀ˩ ʔ ▁ c ɯ ə ŋ ˧˨ ▁ tʰ e ˨˦ ▁ n a w ˧˨ ▁ d a ˧ˀ˥ ▁ z o j ˧˨ ▁ n ɔ j ˨˦ ▁ t i ə p ˦˥

This will raise an misunderstanding for model to learn break between sentence parts.
Pls check it out !!!
Thank you !!!

Answer 1 · 2023-06-20T05:16:41.000Z

As stated in the Readme file, you have to perform Vietnamese word segmentation.
The input should be: dù_sao tiền cũng đã trả rồi , chờ xem phản_ứng từ thị_trường thế_nào đã rồi nói tiếp .