cannot translate the whole paragraph/sentences
Opened this issue ยท 1 comments
๐ Bug
When translating eng_Latn to zho_Hant, there are always missing parts to be translated. It doesn't happen in Zho_hans. Evan yue_hant is better than zho_hant.
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
import ctranslate2
import transformers
src_lang = "eng_Latn"
tgt_lang = "zho_Hant"
translator = ctranslate2.Translator("nllb-200-distilled-600M")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)
content = """
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
"""
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(content))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]
print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))
in this case, "A database of Chinese surnames and Chinese given names (1930-2008)." is not translated. The same issue happened if using transformers only.
It happens for nllb-200-distilled-1.3B too