attardi/wikiextractor

Get all revisions content

Opened this issue · 0 comments

Hi!
As far as I understand (and tried the code), the current implementation assumes that the input dump file contains a single revision per pageID.
The historical dump files contain all revisions of a single page, and when this is given as input for the code, it generates long textual content without splitting it into revisions.
Is there a simple way to "force" the code to take into account the different revisions per pageID?

thank you!