macbre/mediawiki-dump

Document how to read a dump from a local file

fenopa opened this issue · 7 comments

not possible?

It would be great to have such option - it would also allow reading from other wikis. Such as say OSM Wiki, see https://wiki.openstreetmap.org/wiki/Wiki#Wiki_Dumps_.2F_Export

Let's introduce a new MediaWikiDumpFile class:

from mediawiki_dump.dumps import MediaWikiDumpFile

Actually, it's already there :-) The LocalFileDump class is your friend here. I'll update the README to describe that case as well.

https://github.com/macbre/mediawiki-dump/blob/master/mediawiki_dump/dumps.py#L181-L196

@matkoniecz, can you check the following code with the OSM wiki dump that you've mentioned?

    dump = LocalWikipediaDump(dump_file="path/to/osm.dump.xml.bz2")
    reader = DumpReader()

    pages = reader.read(dump)
Traceback (most recent call last):
  File "/home/mateusz/Documents/install_moje/OSM_software/fetch_osm_wiki/dump_reader.py", line 16, in <module>
    for page in reader.read(dump):
  File "/home/mateusz/.local/lib/python3.10/site-packages/mediawiki_dump/reader.py", line 247, in read
    for chunk in dump.get_content():
  File "/home/mateusz/.local/lib/python3.10/site-packages/mediawiki_dump/dumps.py", line 144, in get_content
    yield decompressor.decompress(chunk)
OSError: Invalid data stream

I also tried unpacking file outside and pointing is an input. Looking at file itself I see no obvious corruption.

@macbre Do you have known example of such download working? Maybe I should download small wikipedia and try is it loadable from file for me?

Is it working for you?

Is it working for you?

It does, see #226. Please make sure that the XML file you're reading is not compressed.