
wikiextractor bug

Opened this issue · 2 comments

Thank you for releasing a useful dataset!

I also created wikification dataset from Japanese wikipadia and found that there are two bugs in wikiextractor.
First, the articles that include a colon in the title, such as 未来日記-ANOTHER:WORLD- are ignored. Second, some articles have different page id, e.x. 華麗なるファンタジア's page_id must be 3688400 but 3688399.
Does this happen in your dataset, too??

If so, I can share my fixed codes if you need them! I sent a pull request to wikiextractor but my pull requests aren't merged yet...

By the way, can I write issues in Japanese??

Hi, thanks for your interests!
I just added preprocessed dataset from ja-wiki. Please check it out.
If I have time, I would like to create a dataset with wikiextractor that merges your pull requests.
Currently page-ids are not dumped to final dataset, but I'll check later.

It's ok using Japanese if you want.

Thank you for sharing the preprocessed data!

I confirmed that doc_title2sents in preprocessed_jawiki.zip didn't contain any articles that include a colon in the title.
I'll inform you when my PR is merged. Thanks.