Tokenizers works different between NFD/NFKD and NFC/NFKC normalize functions in lowercase Turkish(and probably some other languages)

Transformers: 3.0.2
Tokenizers: 0.8.1

Hi. First of all thanks for this great library. This is my first issue opening here. I am working at Loodos Tech as a NLP R&D Engineer in Turkey. We are pretraining and finetuning Turkish BERT/ALBERT/ELECTRA models and publizing them.

I found a bug in Tokenizers for Turkish(and possibly some other languages which use non-ASCII alphabet).

For example,

TEXT = "ÇOCUK ŞANLIURFA'DAN GELENLERİ ÖĞÜN OLARAK YİYOR"

bt = BertTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=True)
assert bt.tokenize(TEXT) == ['co', '##cuk', 'san', '##li', '##ur', '##fa', "'", 'dan', 'gelenleri', 'o', '##gun', 'olarak', 'yiyor']

But it should be,

assert bt.tokenize(TEXT) == ['çocuk', 'şanlıurfa', "'", 'dan', 'gelenleri', 'öğün', 'olarak', 'yiyor']

Same for ALBERT tokenizer,

TEXT = "ÇOCUK ŞANLIURFA'DAN GELENLERİ ÖĞÜN OLARAK YİYOR"

at = AlbertTokenizer.from_pretrained("loodos/albert-base-turkish-uncased", do_lower_case=True, keep_accents=False)
assert at.tokenize(TEXT) == ['▁c', 'oc', 'uk', '▁san', 'li', 'urfa', "'", 'dan', '▁gelenleri', '▁o', 'gun', '▁olarak', '▁yiyor']

But it should be,

assert at.tokenize(TEXT) == ['▁çocuk', '▁şanlıurfa', "'", 'dan', '▁gelenleri', '▁öğün', '▁olarak', '▁yiyor']

This is caused by two things:
1- Vocabulary and sentence piece model is created with NFC/NFKC normalization but tokenizer uses NFD/NFKD . NFD/NFKD normalization changes text with Turkish characters I-ı, İ-i, Ç-ç, Ö-ö, Ş-ş, Ğ-ğ, Ü-ü. This causes wrong tokenization, wrong training and lost of informations. Some tokens are never trained.(like "şanlıurfa", "öğün", "çocuk" etc.) NFD/NFKD normalization is not proper for Turkish.

For BERT and ELECTRA, tokenizers executes this code when do_lower_case = True:

def _run_strip_accents(self, text):
    """Strips accents from a piece of text."""
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
        cat = unicodedata.category(char)
        if cat == "Mn":
            continue
        output.append(char)
    return "".join(output)

For ALBERT, tokenizers executes this code when keep_accents = False:

if not self.keep_accents:
    outputs = unicodedata.normalize("NFKD", outputs)
    outputs = "".join([c for c in outputs if not unicodedata.combining(c)])

2- 'I' is not uppercase of 'i' in Turkish. Python's default lowercase or casefold functions do not care this.(check this: https://stackoverflow.com/questions/19030948/python-utf-8-lowercase-turkish-specific-letter)

if is_turkish:
    lower = lower.replace('\u0049', '\u0131')  # I -> ı
    lower = lower.replace('\u0130', '\u0069')  # İ -> i

Probably this normalization function error effects some other languages too. For ASCII, NFD and NFC both work same but for Turkish they don't.

Could you please give optional parameters for normalization function and is_turkish? We need NFKC normalization and casefold with I->ı.

Thanks...

Did you experiment with the FastTokenizers from https://github.com/huggingface/tokenizers?

cc @n1t0

Yes, it is same.

from transformers import BertTokenizerFast

TEXT = "ÇOCUK ŞANLIURFA'DAN GELENLERİ ÖĞÜN OLARAK YİYOR"

bt = BertTokenizerFast.from_pretrained("bert-base-turkish-uncased")
print(bt.tokenize(TEXT))

['co', '##cuk', 'san', '##li', '##ur', '##fa', "'", 'dan', 'gelenleri', 'o', '##gun', 'olarak', 'yiyor']

But it should be: ['çocuk', 'şanlıurfa', "'", 'dan', 'gelenleri', 'öğün', 'olarak', 'yiyor']

We developed custom normalization module here. For now, we use tokenizers like this:

from transformers import BertTokenizerFast
from text_normalization import TextNormalization

bt = BertTokenizerFast.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)

norm = TextNormalization()
TEXT = "ÇOCUK ŞANLIURFA'DAN GELENLERİ ÖĞÜN OLARAK YİYOR"
LOWER = norm.normalize(TEXT)

print(bt.tokenize(LOWER))

and it gives : ['çocuk', 'şanlıurfa', "'", 'dan', 'gelenleri', 'öğün', 'olarak', 'yiyor']

Could you please add config parameters in tokenizer_config.json for:

unicodedata normalization function type(NFD, NFKD, NFC, NFKC)
is_turkish(I->ı, İ->i)

Hello Sir,

Is there any update about this issue?

Hi @abdullaholuk-loodos,

BertTokenizer is based on WordPiece which is a subword segmentation algorithm. It may split a word into more than one piece. In this way, out-of-vocabulary words can be represented. You should not expect to see exact word tokens.

Hi @erncnerky, thanks for reply.

You misunderstood me. I am not mentioning about subword segmentation algorithm. I am talking about normalization algorithm before tokenization.

When do_lower_case=True, tokenizer calls _run_strip_accents(self, text) function.

transformers/src/transformers/models/bert/tokenization_bert.py

Line 420 in 447808c

def _run_strip_accents(self, text):

This function, calls text = unicodedata.normalize("NFD", text) normalization function. NFD normalization is not proper for Turkish because of "ç Ç, ü Ü, ş Ş, ğ Ğ, i İ, ı I" characters. When you change NFD to NFC or NFKC result changes. NFD normalization adds some invisible characters to text when special Turkish characters that I mentioned. NFC normalization does not add these invisible characters. These invisible characters causes different tokenizations.

Corpus normalized with NFKC normalization, then subword algorithm run. So it is correct. No invisible characters. But at inference, NFD normalization changes text for Turkish and causes wrong text with invisible characters.

Please try that:

TEXT1 = "ÇOCUK ŞANLIURFA'DAN GELENLERİ ÖĞÜN OLARAK YİYOR"
  
bt = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=True)
print(bt.tokenize(TEXT1)) 

TEXT2 = "çocuk şanlıurfa'dan gelenleri öğün olarak yiyor"

bt = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
print(bt.tokenize(TEXT2))

As you see, TEXT2 is correct lowercase TEXT1, but results are different because of _run_strip_accents's NFD before tokenization.

It is also same with albert tokenizer's keep_accent=False parameter.

FYI, @julien-c
FYI, @n1t0

Hi @erncnerky, thanks for reply.

You misunderstood me. I am not mentioning about subword segmentation algorithm. I am talking about normalization algorithm before tokenization.

When do_lower_case=True, tokenizer calls _run_strip_accents(self, text) function.

transformers/src/transformers/models/bert/tokenization_bert.py

Line 420 in 447808c

def _run_strip_accents(self, text):

This function, calls text = unicodedata.normalize("NFD", text) normalization function. NFD normalization is not proper for Turkish because of "ç Ç, ü Ü, ş Ş, ğ Ğ, i İ, ı I" characters. When you change NFD to NFC or NFKC result changes. NFD normalization adds some invisible characters to text when special Turkish characters that I mentioned. NFC normalization does not add these invisible characters. These invisible characters causes different tokenizations.

Corpus normalized with NFKC normalization, then subword algorithm run. So it is correct. No invisible characters. But at inference, NFD normalization changes text for Turkish and causes wrong text with invisible characters.

Please try that:

TEXT1 = "ÇOCUK ŞANLIURFA'DAN GELENLERİ ÖĞÜN OLARAK YİYOR"

bt = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=True)
printf(bt.tokenize(TEXT1))

TEXT2 = "çocuk şanlıurfa'dan gelenleri öğün olarak yiyor"

bt = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
printf(bt.tokenize(TEXT2))

As you see, TEXT2 is correct lowercase TEXT1, but results are different because of _run_strip_accents's NFD before tokenization.

It is also same with albert tokenizer's keep_accent=False parameter.

FYI, @julien-c

I had seen the problem. Since you gave exact word tokens which are not mostly expected especially for the morphologically rich languages such as Turkish, I wrote the comment.

Hi @erncnerky, thanks for reply.
You misunderstood me. I am not mentioning about subword segmentation algorithm. I am talking about normalization algorithm before tokenization.
When do_lower_case=True, tokenizer calls _run_strip_accents(self, text) function.

transformers/src/transformers/models/bert/tokenization_bert.py

Line 420 in 447808c

def _run_strip_accents(self, text):

This function, calls text = unicodedata.normalize("NFD", text) normalization function. NFD normalization is not proper for Turkish because of "ç Ç, ü Ü, ş Ş, ğ Ğ, i İ, ı I" characters. When you change NFD to NFC or NFKC result changes. NFD normalization adds some invisible characters to text when special Turkish characters that I mentioned. NFC normalization does not add these invisible characters. These invisible characters causes different tokenizations.
Corpus normalized with NFKC normalization, then subword algorithm run. So it is correct. No invisible characters. But at inference, NFD normalization changes text for Turkish and causes wrong text with invisible characters.
Please try that:
TEXT1 = "ÇOCUK ŞANLIURFA'DAN GELENLERİ ÖĞÜN OLARAK YİYOR"
bt = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=True)
printf(bt.tokenize(TEXT1))
TEXT2 = "çocuk şanlıurfa'dan gelenleri öğün olarak yiyor"
bt = AutoTokenizer.from_pretrained("loodos/bert-base-turkish-uncased", do_lower_case=False)
printf(bt.tokenize(TEXT2))
As you see, TEXT2 is correct lowercase TEXT1, but results are different because of _run_strip_accents's NFD before tokenization.
It is also same with albert tokenizer's keep_accent=False parameter.
FYI, @julien-c

I had seen the problem. Since you gave exact word tokens which are not mostly expected especially for the morphologically rich languages such as Turkish, I wrote the comment.

Thank you for your interest.

Could you mention admins and like issue for taking attention to issue?

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

is there any changes ?

Any workarounds so far? I came across the same issue.