Training on Arabic language
lecidhugo opened this issue · 3 comments
Hello,
Is there any document or guide on how to train on Arabic ?
Is this possible ? if yes what are the requirements ?
Thanks in advance,
Hello @lecidhugo !
You can create the resources for a new language with https://github.com/kermitt2/grisp
The readme describes the process. It's an Hadoop process that is going to take a few hours.
Once done, you can start an environment for Arabic with entity-fishing, the knowledge base will be automatically build. Then you need to train a ranker and a selector model as described here -> https://nerd.readthedocs.io/en/latest/train.html#training-with-wikipedia
Loading the markupFull
is the DB that is time consuming, it stores all the article text content.
You don't need to create embeddings if I remember well, it should work without them. However it improves a bit the disambiguation. This is also quite time consuming (it should be half day for Arabic given the number of articles).
There are 1,080,907 articles in Arabic, so it's a pretty big number, it should be doable and provide decent results.
Thank you very much for your kind reply!
I will try to do it
Note that Arabic is now supported by default, with already trained models and KB resources available, see the documentation.