attardi/wikiextractor

File output does not match stdout output in v3.0.6

adrianeboyd opened this issue · 0 comments

I noticed that the file output does not match the stdout output. It looks like the final article is missing in the file output, possibly due to buffering in a child process.

You can see the difference with -b 0 --json (wikiextractor v3.0.6, linux or osx, python 3.8-3.10):

# 920 lines
wikiextractor --json --no-templates -b 0 -q tnwiki-20220301-pages-articles.xml.bz2 -o - | wc -l
# 919 lines, 921 files (first and last files are empty)
wikiextractor --json --no-templates -b 0 -q tnwiki-20220301-pages-articles.xml.bz2 -o output_dir
wc -l output_dir/*/*

The first empty file is just a minor bug in OutputSplitter, but the final empty file is a missing article. Even with other values of -b, the final article seems to be missing from the final file.

I'm not exactly sure what's going on with the output buffering, but it looks like a minimal fix is to flush or close the file at the end of reduce_process.