attardi/wikiextractor

Bullet points are missing in the final extracted text

miguelwon opened this issue · 0 comments

Found this issue when analysing the result of the page Diffraction. ID: 8603
In section "Patterns" there are three bullet points:

  • The angular spacing of the features...
    ...

These bullet points are ignore and not included in the final cleaned text. I think is because of the asterisk.

To replicate:

I extracted the page with extractPage, then created a new file with the single page from its output. Then executed the WikiExtractor.

python -m wikiextractor.extractPage --id 8603 enwiki-latest-pages-articles-multistream.xml.bz2

python -m wikiextractor.WikiExtractor page_8603.xml --json -o teste