/wiki2txt

A tool to extract plain (unformatted) multilingual text, redirects, links and categories from wikipedia backups (dumps). Designed to prepare clean training data for AI training / Machine Learning software.

Primary LanguagePythonGNU General Public License v2.0GPL-2.0

Watchers