attardi/wikiextractor

Question: Cirrus Extractor vs. "normal" Extractor - who creates cleaner texts?

PhilipMay opened this issue · 1 comments

Hi,

which Extractor do you think creates cleaner texts? The Cirrus Extractor or the "normal" Extractor?

I am asking because I want to use the Wikipedia texsts to train language models based on them. see https://en.wikipedia.org/wiki/BERT_(language_model)

Thanks
Philip

adno commented

Hi,

I just finished (a first version of) a word list project based on the "normal" extractor and XML dumps. I managed to a do a reasonably good job by adding additional cleanup, but if I were to start from scratch I would use the cirrus dumps instead.

The output of the "normal" extractor is a mess (see #300) – you just cannot use it as is if you want clean text.

The cirrus dumps are already cleaned up, so only minimal processing is needed. That said, the current cirrus-extract.py script in this project doesn't work with current cirrus dumps, where articles have "_type":"_doc" (the script requires "_type":"page"). Also, even though the cirrus dump is already relatively clean (compared to wikiextractor output), it would be reasonable to do a little more cleanup than cirrus-extract.py does. This seems like a good start for doing that: https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py (Note that it's specifically for Japanese, so one would need to adjust it based on the target language.)

There are also various minor differences like that headings (and perhaps some other parts of the text included in the XML dumps/wikiextractor output) are omitted from cirrus search dump text.

To sum up: If you need clean text, you can choose from the following options:

  1. Modify my script for word lists, wikipedia-word-frequency-clean, to clean up wikiextractor output. It should be super easy, just process the return values of remove_markup(line) as you need. (Note that original English BERT language model by Google was trained from wikiextractor output with additional cleanup too.)

  2. Modify the script for Japanese BERT to clean up cirrus search dumps: https://github.com/cl-tohoku/bert-japanese/blob/main/make_corpus_wiki.py (Slightly more additional work, but I would bet better results.)

  3. As a last resort, just use the "text":… of each page from the cirrus search dumps "as is". It will still be cleaner than wikiextractor output.

The good thing about wikiextractor is that you can modify it to have custom processing of various Wikipedia markup (templates, links, etc.). But if you all you need is just clean text, it just doesn't cut it.