Arabic script: Implement specialized Segmenter
ManyTheFish opened this issue · 9 comments
Currently, the Arabic Script is segmented on whitespaces and punctuation.
Drawback
Following the dedicated discussion on Arabic Language support and the linked issues, the agglutinative words are not segmented, for example in this comments:
the agglutinated word
الشجرة
=>The Tree
is a combination ofالـ
andشجرة
الـ
is equivalent toThe
and it's always connected (not space separated) to the next word.
Enhancement
We should find a specialized segmenter for the Arabic Script, or else, a dictionary to implement our own segmenter inspired by the Thaï Segmenter.
Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement aSegmenter
or aNormalizer
.
Thanks a lot for your Contribution! 🤝
I'll preface this by clarifying that I have very little experience in this domain space. I'm just a brown person who wants to help keep his languages alive.
I studied the PR to add Hebrew support and was able to grok that a lot of the heavy lifting was being done by niqqud.
I had some time this weekend so I wrote a library that does the same thing as niqqud
, but for languages using the Arabic alphabet - tashkil
.
I was thinking to basically walk through the PR to this repo by @benny-n and adapt it for Arabic, Pashto, Dari, Urdu etc. From my understanding, it may not be perfect, but it will still represent an improvement in the same way that using niqqud
represented an improvement in Hebrew search results.
Before I start working on this though, I just wanted to check if this would be accepted, since it seems like there is a lot of prior discussion on how to approach Arabic and I'm not quite sure what the consensus is right now.
@LGUG2Z, this is not the good issue for your proposal, this issue treats word segmentation, basically "how to split an Arabic text into words", and not normalization.
I read your Reddit post about your work, and we can't really use your library because it conflicts with another Normalizer that removes nonspacing marks for all Languages. However, we don't remove format characters including tatweel/kashida characters so far.
Implementing a normalizer following the model of the one removing the nonspacing marks would enhance the Arabic Language support.
If you want to start the implementation, I'll ask you to create a dedicated issue on this repository.
See you!
The thai segmenter referred to by @ManyTheFish is dependent on nlpO3, which is an NLP library that includes a Thai tokenizer.
I searched for site:github.com arabic tokenizer rust
on neeva.com, and I couldn't find any arabic tokenizer in Rust. It especially seems like it doesn't exist when this user (dedicated to collecting anything related to NLP/ML within arabic language) doesn't have any repositories in Rust.
But I'm not sure, do we need an arabic tokenizer? Perhaps the arabic dictionary file is all we need?
This seems like an exhaustive list of anything related to arabic NLP on github. Given that Rust is absent there, it seems like a new arabic dictionary file (and NLP system) must be created from scratch.
Thank
For more clear demo
https://ar-php.org/github/examples/ar_query.php
There's a query demo i think it just split some characters and make regular expression in sides the query
Regarding segmenter, in Arabic language the starting ال
is almost in 95% of words (I assume) is equal to the
.
As you can see here when I am searching for الدوري
which means the league
I got results. However when I search for دوري
which means league
I got no results.
While in some other words like البانيا
which means Albania
, the starting ال
is not equal to the
and it's part of the word. but it will not make any significant difference in the search results.
If I search for بانيا
which equals to bania
and get results it's not a big deal.
But if I search for دوري
which equals to league
and get no results, while there are many results for الدوري
which equals to the league
, it's a big deal.
So I think any word starts with ال
should be segmented to two words, the one with ال
and the one without ال
and the search should be done on both of them.
So I think any word starts with
ال
should be segmented to two words, the one withال
and the one withoutال
and the search should be done on both of them.
If this a practical solution, we could start implementing this solution immediately.
I am not sure about Meilisearch internal engine if it's applicable to segment one word into to words!!
To make it clear the word like TheTree
الشجرة
should be segmented to thetree
الشجرة
and tree
شجرة
.