cannot translate the whole paragraph/sentences

Question

cannot translate the whole paragraph/sentences

Opened this issue 2 months ago · 1 comments

🐛 Bug

When translating eng_Latn to zho_Hant, there are always missing parts to be translated. It doesn't happen in Zho_hans. Evan yue_hant is better than zho_hant.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

import ctranslate2
import transformers

src_lang = "eng_Latn"
tgt_lang = "zho_Hant"

translator = ctranslate2.Translator("nllb-200-distilled-600M")
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=src_lang)

content = """
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
"""
source = tokenizer.convert_ids_to_tokens(tokenizer.encode(content))
target_prefix = [tgt_lang]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

in this case, "A database of Chinese surnames and Chinese given names (1930-2008)." is not translated. The same issue happened if using transformers only.

Answer 1 · 2024-04-17T12:42:29.000Z

It happens for nllb-200-distilled-1.3B too