This repo functions as a tool to build a dataset of training samples consisting of machine generated and human generated pairs.
Output samples are saved as json objects in each line and would look like this:
(NOT IMPLEMENTED YET) To download the database dumps run python download.py
Download Mediawiki dataset dumps -> Extract (title, text, id) from dataset dumps if text length (in chars) > 1000 -> Extract Only the first 1000 chars of the text
Save each article found as a json object in a new lines (.jsonl fomrat).
Finally extracted files size currently 3.08GB (dump size total compressed ~17GB)
Article amount before: 1177440 (text length > 200)
Article amount after: 1061896 (text length > 1000)
Time GPU large: 11 secs
Just for giggles CPU time: 48 min