Preprocessing method of dataset

Question

Preprocessing method of dataset

ledw opened this issue 5 years ago · 3 comments

Hi @lajanugen,
Thanks for releasing the data and code! Great work.
I would like to ask about the preprocessing method used in the producing the dataset: what was the preprocessing method used to clean the raw text from Wikia? Any chance that the raw text can be released? Thanks.

Answer 1 · 2019-11-20T13:52:38.000Z

Hi @ledw
Thank you for your interest!
We used wikiextractor (https://github.com/attardi/wikiextractor) to clean up the wikias and performed whitespace + punctuation tokenization.
We don't have plans to release the raw data unfortunately due to license issues.

Answer 2 · 2019-11-20T23:01:40.000Z

Hi @lajanugen,
Thanks for your prompt reply!

Answer 3 · 2019-11-21T19:11:02.000Z

Hi @lajanugen ,
A follow up question: when I asked for raw text, I meant by releasing a form of data where the documents are raw input without doing whilespace + punctuation tokenization. Is that feasible? Thanks!