shilad/wikibrain

DumpLoader articles pattern bans use of some parent folder names

vpmalley opened this issue · 0 comments

I am trying to import in my own project the wikibrain project. So far, so good.
When using the GUI to load a language, I got an Exception.
Nothing too bad, only due to the name I picked for my project. I am trying to set up the DB in a subfolder of my project located at **/spring-wikibrain/**.

In the GUI loader command output, I got:

INFO: processing file: ../download/simple/20141025/simplewiki-20141025-pages-articles.xml.bz2
Exception in thread "main" java.lang.IllegalArgumentException: unknown langCode: 'spring-'
        at org.wikibrain.core.lang.Language.getByLangCode(Language.java:100)
        at org.wikibrain.core.cmd.FileMatcher.getLanguage(FileMatcher.java:209)
        at org.wikibrain.loader.DumpLoader.load(DumpLoader.java:80)
        at org.wikibrain.loader.DumpLoader.main(DumpLoader.java:255)

Searching in the code, it is indeed conflicting with the pattern used to figure the language of the articles that are downloaded (in org.wikibrain.core.cmd.FileMatcher):

ARTICLES ("articles",
    Pattern.compile(".*?([a-zA-Z_-]+)wiki.+-pages-articles(\\d+)\\.xml-.+\\.bz2"),
    Pattern.compile(".*?([a-zA-Z_-]+)wiki.+-pages-articles.xml.bz2")),

Nothing too bad, I changed the name of my folder not to include [a-zA-Z_-]wiki in the folder, but still it might become a big issue further down the road.