attardi/wikiextractor

Codec encoding errors in OutputSplitter

cBog opened this issue · 0 comments

cBog commented

I was trying to extract from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 on Ubuntu.

I got an error in the OutputSplitter:

INFO: Using 64 extract processes.
Process ForkProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "{redacted_path}/lib/python3.6/site-packages/wikiextractor/WikiExtractor.py", line 494, in reduce_process
    output.write(ordering_buffer.pop(next_ordinal))
  File "{redacted_path}/lib/python3.6/site-packages/wikiextractor/WikiExtractor.py", line 175, in write
    self.file.write(data)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 2809: ordinal not in range(256)

As I understand, the default file system codec can vary between OSs. I can force it to use unicode by setting LC_CTYPE=C.UTF-8, but the following change I found also works:

diff --git a/wikiextractor/WikiExtractor.py b/wikiextractor/WikiExtractor.py
index ff63783..f6aea70 100755
--- a/wikiextractor/WikiExtractor.py
+++ b/wikiextractor/WikiExtractor.py
@@ -181,7 +181,7 @@ class OutputSplitter():
         if self.compress:
             return bz2.BZ2File(filename + '.bz2', 'w')
         else:
-            return open(filename, 'w')
+            return open(filename, 'w', encoding='utf-8')


 # ----------------------------------------------------------------------

I don't have particularly high confidence in my understanding of this. Is there a reason not to make that change to prevent such issues for others?