kermitt2/entity-fishing

Training on Arabic language

lecidhugo opened this issue · 3 comments

Hello,
Is there any document or guide on how to train on Arabic ?
Is this possible ? if yes what are the requirements ?

Thanks in advance,

Hello @lecidhugo !

You can create the resources for a new language with https://github.com/kermitt2/grisp
The readme describes the process. It's an Hadoop process that is going to take a few hours.

Once done, you can start an environment for Arabic with entity-fishing, the knowledge base will be automatically build. Then you need to train a ranker and a selector model as described here -> https://nerd.readthedocs.io/en/latest/train.html#training-with-wikipedia

Loading the markupFull is the DB that is time consuming, it stores all the article text content.

You don't need to create embeddings if I remember well, it should work without them. However it improves a bit the disambiguation. This is also quite time consuming (it should be half day for Arabic given the number of articles).

There are 1,080,907 articles in Arabic, so it's a pretty big number, it should be doable and provide decent results.

Thank you very much for your kind reply!
I will try to do it

Note that Arabic is now supported by default, with already trained models and KB resources available, see the documentation.