attardi/wikiextractor

Various tags such as q, br, ins, del are not fitered out

adno opened this issue · 1 comments

adno commented

Many elements/tags appear in wikiextractor's output, such as poem, q, ins, del, br, section, onlyinclude, includeonly, math or mathematical equations (with commands such as \mathbf) not enclosed in any tags.

  1. Download this dump: https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
  2. Invoke the following command to list lines that contain the opening tags of these elements:

wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(poem\|q\|section\|ins\|del\|math\|onlyinclude\|br\|chem\)\b'

Examples from the output:

<poem>
<poem style="margin-left:2em">
<br>"domestic:" good automatic telephone system
…
Benzene, <chem>C6H6</chem>, …
…
<section end="Big Brother series" />
…
<onlyinclude>
…
<chem>O2{} + 4H+(aq){} + 4 Fe^{2+}(cyt\,c) -> 2H2O{} + 4 Fe^{3+}(cyt\,c) </chem> formula_1
…
</includeonly><section end=Lineups />

(Not all of the tags appear in this particular dump.)

adno commented

There similar issues with mapframe and score elements (#301) and table formatting (#298).