itkach/slob

Wikimedia data dumps

opk12 opened this issue · 2 comments

opk12 commented

README.md's section Create from MediaWiki sites does not mention https://meta.wikimedia.org/wiki/Data_dumps as Wikimedia publishes database dumps for all wikis, including Wikipedia and Wiktionary, updated monthly or twice a month. Importing the dumps is faster and lighter on resources than crawling, and crawlers seem to be rate-limited.

README does not mention it because the tool to convert mediawiki data dumps into slob hasn't been created, and likely never will be. Tools for slob's predecessor did work with mediawiki data dumps (using mwlib) and the results were decent, but it never could fully match how Mediawiki renders the same content, and over time the gap was only increasing, especially with introduction of Lua-based templates and moving Infobox and other data elements to a separate database. The practical reality is that the only software that can render Mediwiki properly is Mediawiki itself. Don't let this section fool you: I guarantee you none of these are even close to being adequate. Yes, getting rendered articles via Mediawiki takes a while (mwscrape tries to be respectful and is deliberately limited so as to minimize burden it may put on Wikipedia servers), but it works well enough and is fast enough for all practical purposes.

While browsing https://meta.wikimedia.org/wiki/Data_dumps I stumbled a new type of dump seemingly published since few months ago: https://dumps.wikimedia.org/other/enterprise_html/. These include rendered article html. This is a different story! Looks like these may be a viable alternative, I will look into this.