lajanugen/zeshel

Preprocessing method of dataset

ledw opened this issue · 3 comments

ledw commented

Hi @lajanugen,
Thanks for releasing the data and code! Great work.
I would like to ask about the preprocessing method used in the producing the dataset: what was the preprocessing method used to clean the raw text from Wikia? Any chance that the raw text can be released? Thanks.

Hi @ledw
Thank you for your interest!
We used wikiextractor (https://github.com/attardi/wikiextractor) to clean up the wikias and performed whitespace + punctuation tokenization.
We don't have plans to release the raw data unfortunately due to license issues.

ledw commented

Hi @lajanugen,
Thanks for your prompt reply!

ledw commented

Hi @lajanugen ,
A follow up question: when I asked for raw text, I meant by releasing a form of data where the documents are raw input without doing whilespace + punctuation tokenization. Is that feasible? Thanks!