Training on Arabic language

Question

Training on Arabic language

lecidhugo opened this issue 4 years ago · 3 comments

Hello,
Is there any document or guide on how to train on Arabic ?
Is this possible ? if yes what are the requirements ?

Thanks in advance,

Answer 1 · 2020-12-03T23:58:39.000Z

You can create the resources for a new language with https://github.com/kermitt2/grisp
The readme describes the process. It's an Hadoop process that is going to take a few hours.

Once done, you can start an environment for Arabic with entity-fishing, the knowledge base will be automatically build. Then you need to train a ranker and a selector model as described here -> https://nerd.readthedocs.io/en/latest/train.html#training-with-wikipedia

Loading the markupFull is the DB that is time consuming, it stores all the article text content.

You don't need to create embeddings if I remember well, it should work without them. However it improves a bit the disambiguation. This is also quite time consuming (it should be half day for Arabic given the number of articles).

There are 1,080,907 articles in Arabic, so it's a pretty big number, it should be doable and provide decent results.

Answer 2 · 2020-12-08T14:45:31.000Z

Thank you very much for your kind reply!
I will try to do it

Answer 3 · 2022-05-03T21:15:25.000Z

Note that Arabic is now supported by default, with already trained models and KB resources available, see the documentation.