Treat Hawaiian Glottal stop as consonant, not punctuation
Closed this issue · 4 comments
Iʻm struggling to get the ʻokina, the Hawaiian glottal stop character (U+02BB), treated as a letter and not as punctuation in SentencePiece subword tokenization, whether BPE or Unigram. Can someone tell me how I can achieve that please?
This is what I attempted:
spm_train --input=tgt-train.txt --model_prefix=data/tgt_spm --vocab_size=32000 --model_type=bpe --character_coverage=1.0 --output_format=piece --input_sentence_size=1000000 --user_defined_symbols=ʻa,ʻe,ʻi,ʻo,ʻu,ʻā,ʻē,ʻī,ʻō,ʻū
However, only these specific tokens above are listed in the created vocab file, when I want all tokens coming from a word that contains the glottal stop to carry the glottal stop.
Could you try --split_by_unicode_script=false
when training the model with spm_trian? This option disables the pre-tokenization based on Unicode Script. Note that all other punctuation are also treated as the same way, so they might be a part of words.
Thank you for that. Yes it does work, but how to handle punctuation after that?
--split_by_unicode_script=true
(default configuration) performs pre-tokenization that prevents the punctuations from being attached to non-punctuation characters. (More specially, one token will be consist of the characters with the same script type).
There is no other way to control this behavior at this moment. src/unicde_script_map.h defines the script type, so you might want to manually modify the script type.
Thank you. Works like a charm.