Tibetan sentence tokenizer designed specifically for data preparation.
pip install git+https://github.com/OpenPecha/bo_sent_tokenizer.git
Important Note: If speed is essential, prioritize sentence segmentation over sentence tokenization.
from bo_sent_tokenizer import tokenize
text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"
tokenized_text = tokenize(text)
print(tokenized_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n'
code is refered from op_mt_tools and made minor changes to get the following desired output.
The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།' is clean Tibetan text.
The text 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།' contains an illegal token 'བབབབབབབབནམ'.
The text 'ངའི་མིང་ལ་Thomas་ཟེར།' includes characters from another language.
The text 'ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།' contains non-Tibetan symbols '(', and ')'.
If the text is clean, it is retained. If a sentence contains an illegal token or characters from another language, that sentence is excluded. If a sentence contains non-Tibetan symbols, these symbols are filtered out, and the sentence is retained.
from bo_sent_tokenizer import segment
text = "ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\n ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ། ངའི་མིང་ལ་Thomas་ཟེར། ཁྱེད་དེ་རིང་(བདེ་མོ་)ཡིན་ནམ།"
segmented_text = segment(text)
print(segmented_text) #Output:> 'ཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་ནམ།\nཁྱེད་དེ་རིང་བདེ་མོ་ཡིན་བབབབབབབབནམ།\nངའི་མིང་ལ་ ་ཟེར།\nཁྱེད་དེ་རིང(བདེ་མོ་)ཡིན་ནམ།\n'
Closing Punctuation: Characters in the Tibetan language that symbolize the end of a sentence, similar to a full stop in English.
Opening Punctuation: Characters in the Tibetan language that symbolize the start of a sentence.
-
Preprocessing: All carriage returns and new lines are removed from the string.
-
Splitting into Parts: The preprocessed text is then split by closing punctuation using a regular expression.
-
Joining the Parts:
- Empty parts are ignored.
- In some cases, closing punctuation appears immediately after opening punctuation, so care is taken not to split these instances.
Example of a valid Tibetan sentence: ༄༅།།བོད་ཀྱི་གསོ་བ་རིག་པའི་གཞུང་ལུགས་དང་དེའི་སྐོར་གྱི་དཔྱད་བརྗོད།
- ༄༅ = opening punctuation
- །། = closing punctuation
-
Filtering Text: Only Tibetan characters and a few predefined symbols are retained; all other characters are removed.
Note:
- Closing punctuation, opening punctuation, and predefined symbols are defined in the file
vars.py
- To have a better understanding of the code, refer to the test cases in
test_segmenter.py