dbpedia/extraction-framework

Minidump for Wikidata -> several issues

kurzum opened this issue · 6 comments

I updated the minidump with ./createMinidump.sh and noticed the following:

  1. irregular paths:
    URLs from URI list are downloaded into individual files into subfolders:
LANGUAGE wikidata.org/wiki/Q75135502
TARGET: ../resources/minidumps/wikidata.org/wiki/Q75135502/wiki.xml
  1. non merged
    All Wikidata tests are loaded into many separate files, but they should be merged into one big file

  2. test failure
    due to the missing wiki.xml.bz2 in ../resources/minidumps/wikidata.org/ there was a file not found exception and the tests failed

In general I am wondering whether this was working at all. It is so different from the rest and no care seems to have been taken to make it look like the previous structure.
Please remove individual wiki.xml files. There should be one big wikidata file. Note that there is a wikidata folder already, now a wikidata.org folder was introduced as well.
It is quite confusing.

So I know why these problems occurred . During gsoc I was downloading pages from wikidata dump manually without using .createMinidump.sh. At that moment I didn’t understand how to use it and thought that there was no difference between downloading the part of the dump, copying the page from dump, then pasting it into wiki.xml file and using bash with .createMinidump.sh with uri.list to download the page. But now I see that downloading manually is not a good idea. Today I tried to fix it but the problem is that wikidata urls are not covered by .createMinidump.sh regexes. As I understand it covers only urls with a pattern like {someLanguege}.wikipedia.org and in cases of wikimedia commons and wikipedia it works fine but in the case of wikidata it doesn’t work because it doesn’t cover urls like wikidata.org/wiki/Q75135502 . And at the moment I don't know how and where to add necessary regexes in .createMinidump.sh because I am new to bash

Hi @jlareck,
well it can't stay like this.

  1. you learn bash. In principle bash is the best tool for data science as you can rapid prototype anything. you should definitely learn it. Otherwise many tasks will become 10 times slower.
  2. we remove the bash script and make a scala launcher out of it.

@Vehnem I would prefer 2 in this case as we also would like to implement advanced features with Jena/SPARQL

Hi all,
I will try to learn bash but it will take some time and I can also try to help with the implementation of that Scala launcher if you will work on it.

But at the moment I think we just need to delete the wikidata.org folder so that tests will pass and leave the wikidata folder with the existing wiki.xml.bz2 that I created manually during gcoc

Deleting the folder fixes nothing. It will be generated again the next time someone runs ./createMinidump.sh
I wil pull it though nevertheless.

We can also delete the wikidata urls from uris.lst, and when someone runs ./createMinidump.sh it will not be created. But I think that we need to create some other uris.lst (for example wikidataUris.lst) which will contain only wikidata pages urls. This list will be used in future upgrading ./createMinidump.sh or developing that Scala launcher