attardi/wikiextractor

Parsing seems to exclude some part of the page

franluca opened this issue · 0 comments

Thanks for the great library!

I noticed that the resulting entries may miss some meaningful content, e.g.

{"id": "75159532", "revid": "39374154", "url": "https://en.wikipedia.org/wiki?curid=75159532", "title": "Tyszko", "text": "Tyszko is a surname. Notable people with the surname include: "}

is missing the list of notable people.

I'm using standard the command

python -m', wikiextractor.WikiExtractor <dump name> --json -o <output folder>

Am I missing something?

Thanks again,
Luca