meilisearch/charabia

Arabic script: Implement specialized Segmenter

ManyTheFish opened this issue · 9 comments

Currently, the Arabic Script is segmented on whitespaces and punctuation.

Drawback

Following the dedicated discussion on Arabic Language support and the linked issues, the agglutinative words are not segmented, for example in this comments:

the agglutinated word الشجرة => The Tree is a combination of الـ and شجرة
الـ is equivalent to The and it's always connected (not space separated) to the next word.

Enhancement

We should find a specialized segmenter for the Arabic Script, or else, a dictionary to implement our own segmenter inspired by the Thaï Segmenter.


Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

I'll preface this by clarifying that I have very little experience in this domain space. I'm just a brown person who wants to help keep his languages alive.

I studied the PR to add Hebrew support and was able to grok that a lot of the heavy lifting was being done by niqqud.

I had some time this weekend so I wrote a library that does the same thing as niqqud, but for languages using the Arabic alphabet - tashkil.

I was thinking to basically walk through the PR to this repo by @benny-n and adapt it for Arabic, Pashto, Dari, Urdu etc. From my understanding, it may not be perfect, but it will still represent an improvement in the same way that using niqqud represented an improvement in Hebrew search results.

Before I start working on this though, I just wanted to check if this would be accepted, since it seems like there is a lot of prior discussion on how to approach Arabic and I'm not quite sure what the consensus is right now.

@LGUG2Z, this is not the good issue for your proposal, this issue treats word segmentation, basically "how to split an Arabic text into words", and not normalization.
I read your Reddit post about your work, and we can't really use your library because it conflicts with another Normalizer that removes nonspacing marks for all Languages. However, we don't remove format characters including tatweel/kashida characters so far.
Implementing a normalizer following the model of the one removing the nonspacing marks would enhance the Arabic Language support.
If you want to start the implementation, I'll ask you to create a dedicated issue on this repository.

See you!

The thai segmenter referred to by @ManyTheFish is dependent on nlpO3, which is an NLP library that includes a Thai tokenizer.

I searched for site:github.com arabic tokenizer rust on neeva.com, and I couldn't find any arabic tokenizer in Rust. It especially seems like it doesn't exist when this user (dedicated to collecting anything related to NLP/ML within arabic language) doesn't have any repositories in Rust.

But I'm not sure, do we need an arabic tokenizer? Perhaps the arabic dictionary file is all we need?

This seems like an exhaustive list of anything related to arabic NLP on github. Given that Rust is absent there, it seems like a new arabic dictionary file (and NLP system) must be created from scratch.

Hello @amab8901 and @aljabr,

Thank you for your suggestions, I will study them!

Thank

For more clear demo

https://ar-php.org/github/examples/ar_query.php

There's a query demo i think it just split some characters and make regular expression in sides the query

Regarding segmenter, in Arabic language the starting ال is almost in 95% of words (I assume) is equal to the.
As you can see here when I am searching for الدوري which means the league I got results. However when I search for دوري which means league I got no results.
Screenshot 2023-04-06 194209
Screenshot 2023-04-06 194239

While in some other words like البانيا which means Albania, the starting ال is not equal to the and it's part of the word. but it will not make any significant difference in the search results.
If I search for بانيا which equals to bania and get results it's not a big deal.

But if I search for دوري which equals to league and get no results, while there are many results for الدوري which equals to the league, it's a big deal.

So I think any word starts with ال should be segmented to two words, the one with ال and the one without ال and the search should be done on both of them.

So I think any word starts with ال should be segmented to two words, the one with ال and the one without ال and the search should be done on both of them.

If this a practical solution, we could start implementing this solution immediately.
I am not sure about Meilisearch internal engine if it's applicable to segment one word into to words!!

To make it clear the word like TheTree الشجرة should be segmented to thetree الشجرة and tree شجرة.