WikiExtractor doesnt extract text for bn, hi

Question

WikiExtractor doesnt extract text for bn, hi

arijitx opened this issue 2 years ago · 1 comments

Hi, I tried the wikiextractor for wikisource dump in bn,hi and es. For bn and hi it doesnt work only extracts one or two words

{"id": "5", "url": "https://bn.wikisource.org/wiki?curid=5", "title": "সানাই/গানের জাল", "text": "সানাই/গানের জাল\n\n<pages index=\"সানাই-রবীন্দ্রনাথ ঠাকুর.djvu\" from=88 to=88 header=1/>"}

While for es it seems to be working.

Answer 1 · 2022-04-04T18:06:57.000Z

Which version of the WikiExtractor are you using locally? The extraction uses an older version. Can you update your version and try again locally? If that doesn't help and the problem persists on the latest version, I would say the bug report should be done in https://github.com/attardi/wikiextractor/issues. If if works with the latest version, I will need to look into updating what we use in the extraction process.