Provide options in the tokenize function
abedkhooli opened this issue · 8 comments
This looks promising for Arabic tokenization. Not an issue, but It'll be great to provide options in the tokenizer - ex. remove tashkeel and filter non-Arabic words in a mixed text.
Thanks.
The suggested options can be get by combining available functions, What do you think?
True, but it is more convenient to have options. I plan to tokenize articles from Arabic Wikipedia where text has English words and some Arabic words have 7arakaat (punctuation will be removed thanks to Gensim's min token length, but good to add removal as an option).
Hard coding these options in the tokenizer complicates things. Instead, I suggest adding two optional arguments to the function; condition and morph.
condition : to pass functions that return a boolean value
morph : to pass functions that change the shape of the word
def tokenize(text=u"", condition=False, morph=False):
if text == u'':
return []
else:
tokens = TOKEN_PATTERN.split(text)
tokens = [TOKEN_REPLACE.sub('', x) for x in tokens if TOKEN_REPLACE.sub('', x)]
if condition:
tokens = [x for x in tokens if condition(x)]
if morph:
tokens = [morph(x) for x in tokens]
return tokens
To remove tashkeel and filter out non-Arabic words:
text = "ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
tokenize(text, condition=is_arabicrange, morph=strip_tashkeel)
>> ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']
This structure will enable us to create functions on the fly and pass them:
text = "طلع البدر علينا من ثنيات الوداع"
tokenize(text, condition=lambda x: x.startswith('ال'))
>> ['البدر', 'الوداع']
How about making both as lists with default as [], in case one needs more than one condition (restrict Arabic, keep or remove numbers, exclude stop words ...etc).
Great idea! here is the code:
def tokenize(text="", conditions=[], morphs=[]):
if text:
# to be tolerant and allow for a single condition and/or morph to be passed
# without having to enclose it in a list
if type(conditions) is not list: conditions = [conditions]
if type(morphs) is not list: morphs = [morphs]
tokens = TOKEN_PATTERN.split(text)
tokens = [TOKEN_REPLACE.sub('', tok) for tok in tokens if TOKEN_REPLACE.sub('', tok)]
if conditions:
tokens = [tok for tok in tokens if all([cond(tok) for cond in conditions])]
if morphs:
def morph(tok):
for m in morphs:
tok = m(tok)
return tok
tokens = [morph(tok) for tok in tokens]
return tokens
else:
return []
Looks good if it passes tests. One good source of text for tests is ar wikipedia dumps.
Examples will be like that, after modify conditions and morphs.
To remove tashkeel and filter out non-Arabic words:
text = "ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
tokenize(text, conditions=is_arabicrange, morphs=strip_tashkeel)
>> ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']
This structure will enable us to create functions on the fly and pass them:
text = "طلع البدر علينا من ثنيات الوداع"
tokenize(text, conditions=lambda x: x.startswith('ال'))
>> ['البدر', 'الوداع']
I add new features to tokenize function, in new version 0.6.3
Thans a lot