shilad/wikibrain

Exception PipelineLoader -l en,de,it

Closed this issue · 3 comments

Hi,

I am performing PipelineLoader on a powerful server to train wikapidia with languages: en, de and it. During execution of org.wikapidia.dao.load.WikiTextLoader the shell logs several times attached exception but the program is still executing on other tasks. Do you think that is better to stop the execution or it's a known warning and the program will normally work at the end?
Furthermore, have you any idea about of how much ram I have to allocate to the jvm in order to be sure that the training will properly work?

Thanks in advance

exception

These parsing errors look scary, but they are "normal". They indicate malformed wikitext (i.e. human errors). I just added issue #160 to make these messages more palatable.

I've successfully loaded the languages you mention with ~30GB of memory, but less may work. One important note: h2 didn't scale well for me for this large dataset, so I switched to postgres. I added details about how to do this to the README: https://github.com/shilad/wikAPIdia/blob/master/README.md#using-external-databases

Let me know how it goes!

Hi shilad and thanks for your answer.
Sorry could you tell me where is located the default reference.conf?
I have probably just to create override.conf and I have to put it in the wikiapidia root dir, right?
Thanks again

Good question: https://github.com/shilad/wikAPIdia/blob/master/wikAPIdia-core/src/main/resources/reference.conf I've also linked it now in the README

Re: override.conf. You're right. I'm not sure about working directories, though. To be safe you should specify the override.conf's absolute (not relative) path to the EnvBuilder.

Thanks for being a beta tester for us!