attardi/wikiextractor

fails on the first file

vsraptor opened this issue · 2 comments

INFO: Preprocessed 22100000 pages
INFO: Preprocessed 22200000 pages
INFO: Loaded 738901 templates in 4795.6s
INFO: Starting page extraction from enwiki-latest-pages-articles.xml.bz2.
INFO: Using 7 extract processes.
Process ForkProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 494, in reduce_process
output.write(ordering_buffer.pop(next_ordinal))
File "/my/py38/lib/python3.8/site-packages/wikiextractor/WikiExtractor.py", line 173, in write
self.file.write(data)
File "/usr/lib/python3.8/bz2.py", line 245, in write
compressed = self._compressor.compress(data)
TypeError: a bytes-like object is required, not 'str'


    def write(self, data):
        self.reserve(len(data))
        if self.compress:
            self.file.write(data)
        else:
            self.file.write(data)

should be :

    def write(self, data):
        self.reserve(len(data))
        if self.compress:
            self.file.write(data.encode('utf8'))
        else:
            self.file.write(data)
rxm commented

I had the same issue and @vsraptor 's modification fixed it for me. Thank you for posting.

As you are in that file you could also replace Bzip2 output compression to Gzip (adding an import gzip under the import bz2 line). I try to work with compressed files downstream, and Gzip files are significantly faster to deal with at a small price in size.

def open(self, filename):
        if self.compress:
            # return bz2.BZ2File(filename + '.bz2', 'w')
            return gzip.GzipFile(filename + '.gz', mode='w')
        else:
            return open(filename, 'w')
rxm commented

I noticed that the last compressed file created (as given by NextFile) when using the --compressed flag is incomplete. I have tried flush, closes, and scattered sleeps but I have not yet found where the problem is (this using bz2 compression). Any ideas?

I instrumented the OutputSplitter class and found that OutputSplitter.close() is not called for the last file. There are also a few extra writes to the last file. Wikiextractor is a multiprocess script that has several processes reading the dump and one reduce_process writing the results. If it runs out of things to write it terminates and leaves it to the calling process to close the OutputSplitter object, but at that point they are different. Adding an output.close() to the bottom of reduce_process closes the currently open file.

It also works when using gzip.GzipFile.