google/sentencepiece

Treat Hawaiian Glottal stop as consonant, not punctuation

Closed this issue · 4 comments

Iʻm struggling to get the ʻokina, the Hawaiian glottal stop character (U+02BB), treated as a letter and not as punctuation in SentencePiece subword tokenization, whether BPE or Unigram. Can someone tell me how I can achieve that please?

This is what I attempted:

spm_train --input=tgt-train.txt --model_prefix=data/tgt_spm --vocab_size=32000 --model_type=bpe --character_coverage=1.0 --output_format=piece --input_sentence_size=1000000 --user_defined_symbols=ʻa,ʻe,ʻi,ʻo,ʻu,ʻā,ʻē,ʻī,ʻō,ʻū

However, only these specific tokens above are listed in the created vocab file, when I want all tokens coming from a word that contains the glottal stop to carry the glottal stop.

Could you try --split_by_unicode_script=false when training the model with spm_trian? This option disables the pre-tokenization based on Unicode Script. Note that all other punctuation are also treated as the same way, so they might be a part of words.

Thank you for that. Yes it does work, but how to handle punctuation after that?

--split_by_unicode_script=true (default configuration) performs pre-tokenization that prevents the punctuations from being attached to non-punctuation characters. (More specially, one token will be consist of the characters with the same script type).
There is no other way to control this behavior at this moment. src/unicde_script_map.h defines the script type, so you might want to manually modify the script type.

Thank you. Works like a charm.