linuxscout/pyarabic

Provide options in the tokenize function

abedkhooli opened this issue · 8 comments

This looks promising for Arabic tokenization. Not an issue, but It'll be great to provide options in the tokenizer - ex. remove tashkeel and filter non-Arabic words in a mixed text.

Thanks.
The suggested options can be get by combining available functions, What do you think?

True, but it is more convenient to have options. I plan to tokenize articles from Arabic Wikipedia where text has English words and some Arabic words have 7arakaat (punctuation will be removed thanks to Gensim's min token length, but good to add removal as an option).

Hard coding these options in the tokenizer complicates things. Instead, I suggest adding two optional arguments to the function; condition and morph.

condition : to pass functions that return a boolean value
morph : to pass functions that change the shape of the word

def tokenize(text=u"", condition=False, morph=False):
    if text == u'':
        return []
    else:
        tokens = TOKEN_PATTERN.split(text)
        tokens = [TOKEN_REPLACE.sub('', x) for x in tokens if TOKEN_REPLACE.sub('', x)]
        if condition:
            tokens = [x for x in tokens if condition(x)]
        if morph:
            tokens = [morph(x) for x in tokens]
        return tokens

To remove tashkeel and filter out non-Arabic words:

text = "ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
tokenize(text, condition=is_arabicrange, morph=strip_tashkeel)

>> ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']

This structure will enable us to create functions on the fly and pass them:

text = "طلع البدر علينا من ثنيات الوداع"
tokenize(text, condition=lambda x: x.startswith('ال'))

>> ['البدر', 'الوداع']

How about making both as lists with default as [], in case one needs more than one condition (restrict Arabic, keep or remove numbers, exclude stop words ...etc).

Great idea! here is the code:

def tokenize(text="", conditions=[], morphs=[]):
    if text:
        # to be tolerant and allow for a single condition and/or morph to be passed
        # without having to enclose it in a list
        if type(conditions) is not list: conditions = [conditions]
        if type(morphs) is not list: morphs = [morphs]
        
        tokens = TOKEN_PATTERN.split(text)
        tokens = [TOKEN_REPLACE.sub('', tok) for tok in tokens if TOKEN_REPLACE.sub('', tok)]
        
        if conditions:
            tokens = [tok for tok in tokens if all([cond(tok) for cond in conditions])]
        if morphs:
            def morph(tok):
                for m in morphs:
                    tok = m(tok)
                return tok
            tokens = [morph(tok) for tok in tokens]
        return tokens
    else:
        return []

Looks good if it passes tests. One good source of text for tests is ar wikipedia dumps.

Examples will be like that, after modify conditions and morphs.
To remove tashkeel and filter out non-Arabic words:

text = "ِاسمٌ الكلبِ في اللغةِ الإنجليزية Dog واسمُ الحمارِ Donky"
tokenize(text, conditions=is_arabicrange, morphs=strip_tashkeel)

>> ['اسم', 'الكلب', 'في', 'اللغة', 'الإنجليزية', 'واسم', 'الحمار']

This structure will enable us to create functions on the fly and pass them:

text = "طلع البدر علينا من ثنيات الوداع"
tokenize(text, conditions=lambda x: x.startswith('ال'))

>> ['البدر', 'الوداع']

I add new features to tokenize function, in new version 0.6.3
Thans a lot