common-voice/cv-sentence-extractor

WikiExtractor doesnt extract text for bn, hi

arijitx opened this issue · 1 comments

Hi, I tried the wikiextractor for wikisource dump in bn,hi and es. For bn and hi it doesnt work only extracts one or two words

{"id": "5", "url": "https://bn.wikisource.org/wiki?curid=5", "title": "সানাই/গানের জাল", "text": "সানাই/গানের জাল\n\n<pages index=\"সানাই-রবীন্দ্রনাথ ঠাকুর.djvu\" from=88 to=88 header=1/>"}

While for es it seems to be working.

Which version of the WikiExtractor are you using locally? The extraction uses an older version. Can you update your version and try again locally? If that doesn't help and the problem persists on the latest version, I would say the bug report should be done in https://github.com/attardi/wikiextractor/issues. If if works with the latest version, I will need to look into updating what we use in the extraction process.