Minidump for Wikidata -> several issues
kurzum opened this issue · 6 comments
I updated the minidump with ./createMinidump.sh
and noticed the following:
- irregular paths:
URLs from URI list are downloaded into individual files into subfolders:
LANGUAGE wikidata.org/wiki/Q75135502
TARGET: ../resources/minidumps/wikidata.org/wiki/Q75135502/wiki.xml
-
non merged
All Wikidata tests are loaded into many separate files, but they should be merged into one big file -
test failure
due to the missing wiki.xml.bz2 in../resources/minidumps/wikidata.org/
there was a file not found exception and the tests failed
In general I am wondering whether this was working at all. It is so different from the rest and no care seems to have been taken to make it look like the previous structure.
Please remove individual wiki.xml files. There should be one big wikidata file. Note that there is a wikidata
folder already, now a wikidata.org
folder was introduced as well.
It is quite confusing.
Branch issue-669-Minidump_for_Wikidata_several_issues created!
So I know why these problems occurred . During gsoc I was downloading pages from wikidata dump
manually without using .createMinidump.sh.
At that moment I didn’t understand how to use it and thought that there was no difference between downloading the part of the dump, copying the page from dump, then pasting it into wiki.xml file and using bash with .createMinidump.sh
with uri.list
to download the page. But now I see that downloading manually is not a good idea. Today I tried to fix it but the problem is that wikidata urls are not covered by .createMinidump.sh
regexes. As I understand it covers only urls with a pattern like {someLanguege}.wikipedia.org
and in cases of wikimedia commons
and wikipedia
it works fine but in the case of wikidata
it doesn’t work because it doesn’t cover urls like wikidata.org/wiki/Q75135502
. And at the moment I don't know how and where to add necessary regexes in .createMinidump.sh
because I am new to bash
Hi @jlareck,
well it can't stay like this.
- you learn bash. In principle bash is the best tool for data science as you can rapid prototype anything. you should definitely learn it. Otherwise many tasks will become 10 times slower.
- we remove the bash script and make a scala launcher out of it.
@Vehnem I would prefer 2 in this case as we also would like to implement advanced features with Jena/SPARQL
Hi all,
I will try to learn bash but it will take some time and I can also try to help with the implementation of that Scala launcher if you will work on it.
But at the moment I think we just need to delete the wikidata.org
folder so that tests will pass and leave the wikidata
folder with the existing wiki.xml.bz2
that I created manually during gcoc
Deleting the folder fixes nothing. It will be generated again the next time someone runs ./createMinidump.sh
I wil pull it though nevertheless.
We can also delete the wikidata
urls from uris.lst
, and when someone runs ./createMinidump.sh
it will not be created. But I think that we need to create some other uris.lst
(for example wikidataUris.lst
) which will contain only wikidata pages urls. This list will be used in future upgrading ./createMinidump.sh
or developing that Scala launcher