Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts
sven-nm opened this issue · 1 comments
System Info
transformers
version: 4.45.2- Platform: Linux-5.4.0-193-generic-x86_64-with-glibc2.31
- Python version: 3.12.7
- Huggingface_hub version: 0.25.2
- Safetensors version: 0.4.5
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
BatchEncoding.encodings[0].word_ids
has alignment errors when working with diacritics (i.e. special accents). Here is a minimal working example:
from typing import List
import transformers
import unicodedata
# Instanciate the PreTrainedTokenizerFast
model_name = 'FacebookAI/xlm-roberta-base'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, add_prefix_space=False)
# Example sentences
sentences: List[str] = [
"""And Odys- seus in 1. 1367, without caring to resent the sneer, simply reaffirms his right to take a line of his own, and pleads the reasonableness of his trying to win those in authority over to his side. On which Agamemnon (1. 1358) throws the entire responsibility on Odysseus, and Odysseus says (1. 1369), ‘ That makes no differ ence. Your consent, in whatever terms it is granted, will be equally kind.” If this is rejected, 1. 1366 must refer not to Odysseus’ words, but merely to his attitude of dissent. 1. 1367 is thus less pointed. For the meaning given to ἐνθάδ᾽ ἵξομαι, l. 136%, cp. Eur. Androm. 342, ἀλλ’ εἶσιν of xpf,—and for ὡς dv, 1. 1369, cp. O. C. 1361, and note. 1371. σοὶ μέν, ker A] For this un- gracious expression, cp. O. T. 671, 2, τό γὰρ σόν, οὐ τὸ τοῦδ᾽, ἐποικτείρω στόμα | ἐλεινόν, οὗτος δ᾽, ἔνθ᾽ ἂν 7, στυγήσεται. 1372. κἀκεῖ κἀνθάδ᾽ | E.on 1,. 841.}.γ8. 1373. σοὶ δὲ. ἃ Ἐχρή.] ‘You may do what you must:’ an ill-humoured way of saying, ‘Do as you please.” χρή, although rejected by Dindgrf and others in favour of χρῇς, i.e χρήζεις, is not inexpressive,and is possibly right. Cp. El. 606.—Exit Agamemnon. 1375. τοιοῦτον ὄντα] ‘While you act in this way. Cp. Phil. 1049, οὗ γὰρ τοιούτων δεῖ, τοιοῦτός εἰμ᾽ ἐγώ,""",
# """Hello, this is a long sentence in ASCII with a lot of words, and it should be tokenized correctly 1234 how do you feel ? Hello, my name Foo, I'm a friend of Bar.""",
]
# Convert to NFC to make sure there is no floating combining character
sentences = [unicodedata.normalize('NFC', s) for s in sentences]
# Let's start with a working scenario. Here, I pre-tokenize inputs my self, with a blunt
# split. After that, we run the tokenizer and compare the maximum index in the
# `BatchEncoding.encodings[0].word_ids`... It should be equal to the length of the input -1.
sentences_pretokenized: List[List[str]] = [s.split() for s in sentences]
batch_encoding = tokenizer(sentences_pretokenized, # ⚠️ Using the pretokenized sentences (List[List[str]])
padding=True,
truncation=True,
max_length=tokenizer.model_max_length,
pad_to_multiple_of=tokenizer.model_max_length,
add_special_tokens=True,
is_split_into_words=True) # ⚠️ Setting this to True
max_word_id = max([word_id for word_id in batch_encoding.encodings[0].word_ids if word_id is not None])
number_of_words = len(sentences_pretokenized[0])
print(f"Max word_id: {max_word_id}") # 225 ✅
print(f"Real number of words: {number_of_words}") # 226
# Good, this is what we were hoping to see. Alignment is correct. However, let's look at what
# happens if I pass the sentences directly, as Tokenizer should accept them:
batch_encoding = tokenizer(sentences, # ⚠️ Using the raw sentences (List[str])
padding=True,
truncation=True,
max_length=tokenizer.model_max_length,
pad_to_multiple_of=tokenizer.model_max_length,
add_special_tokens=True,
is_split_into_words=False) # ⚠️ Setting this to False (default, but explicit for clarity)
max_word_id = max([word_id for word_id in batch_encoding.encodings[0].word_ids if word_id is not None])
number_of_words = len(sentences_pretokenized[0])
print(f"Max word_id: {max_word_id}") # 231 ❌ WRONG!
print(f"Real number of words: {number_of_words}") # 226
# Now let us see where the alignment starts to mismatch:
for word_id, token in zip(batch_encoding.encodings[0].word_ids, batch_encoding.encodings[0].tokens):
if word_id is None:
print(f"Token: {token}")
continue
try:
print(f"Word: {sentences_pretokenized[0][word_id]},\t\tToken: {token}")
except:
print("-------ERROR-------")
print(f"Token: {token}")
# .....
# Word: ἐνθάδ᾽, Token: ▁ἐ
# Word: ἐνθάδ᾽, Token: ν
# Word: ἐνθάδ᾽, Token: θ
# Word: ἐνθάδ᾽, Token: άδ
# Word: ἵξομαι,, Token: ▁
# Word: ἵξομαι,, Token: ̓ # <--- This is a combining diacritic seems to be causing the issue
# Word: l., Token: ▁
# Word: l., Token: ἵ
# Word: l., Token: ξ
# Word: l., Token: ομαι
Expected behavior
NOTE. I am aware that similar problem have been raised (huggingface/transformers#9637), and that the problem also exists with other models, even with ASCII-only examples (e.g. setting model_name
to bert-base-uncased
and using the second example only), but FacebookAI/xlm-roberta-base
produces seemless alignment with ASCII chars.
I think there should be at least a warning as misalignments can have dramatic downstream consequences (thinking notably of token classification tasks).
Hey! Pretty sure this is related to the tokenizers
library directly. I don't have time to investigate as of right now, hope someone can help! 🤗