kermitt2/entity-fishing

Training entity-fishing for other languages

rautant opened this issue · 5 comments

Can the training instructions (https://nerd.readthedocs.io/en/latest/train.html) be used to train models in languages that are currently not supported? Or are they just for retraining with supported languages?
I would like a have entity-fishing for Finnish language, how could I do that?
I noticed that there is a Japanese branch (https://github.com/kermitt2/entity-fishing/tree/japanese).

Hi @rautant, in princiiple to add a new language there is a preprocessed step non yet documented (it requires some more review before).

However regarding Finnish (and in general the condition to have or not a new supported language) the coverage in Wikipedia (number of articles) is too low. As you can see here: https://meta.wikimedia.org/wiki/List_of_Wikipedias#1_000_000+_articles fininsh has less than 500000 articles, which is likely to output poor disambiguation performances.

That's too bad. It is mentioned in the 'Train and evaluate' section that only a sample of documents is used for training. How big is this sample?

Hello! Another issue with Finnish is the super rich morphology of the language, we would probably need a dedicated morphological analyzer to get enough mentions and candidates prior to disambiguation.

Usually 5000 wikipedia articles are used for training the ranker model, and 500 for the selection model.

It is certainly true that Finnish is very morphologically rich, but there are existing solutions for this already. I have tried FinnPos (https://github.com/mpsilfve/FinnPos) for lemmatization and the results seem great. So you could change all words to a base form before using NERD, maybe even treat compound words. If you publish documentation for training in other languages one day I would still consider trying it.

There are solutions for sure, my point was rather that for supporting Finnish it won't be just importing the fiwiki data and retraining. It has an impact on the current pipeline/architecture both on the side of parsing and loading all wikipedia data (label statistics are on word-form level, not for lemma, so we need to lemmatize all the fiwiki articles), and on the side of processing incoming text (as we need to lemmatize text).