gambolputty/wikitable2csv

CSV contains citation link text

ptrstn opened this issue · 1 comments

Hello,

when the wiki table contains a citation (e.g. [2] ), the generated csv will interpret it as pure text. This is probably not desired.

Example: https://de.wikipedia.org/wiki/Liste_traditioneller_Radikale#Tabelle_der_Radikale

citation

Output:

Nr.,Zeichen (Varianten),Pīnyīn,Bedeutung und Anmerkungen,Häufig-keit,Kurz-zeichen,Beispiele
147,.mw-parser-output .Hant{font-size:110%}見,jiàn,sehen,161,见[2],規親覺觀
148,角,jiǎo,"Horn, Ecke",158,,觚解觕觥觸
149,言 (訁 links),yán,"sprechen, Wort",861,讠[2]links,誁詋詔評詗詥試詧

(The [2] is the undesired text, because it is useless by itself)

The HTML responsible for this is:

<td>
   <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932629">
   <span lang="zh-Hans" class="Hans"></span>
   <sup id="cite_ref-s_2-1" class="reference">
      <a href="#cite_note-s-2">[2]</a>
   </sup>
</td>

Can the citation links (hyperlinks with square brackets) be removed when generating the csv?
So basically all the <a> tags that are surrounded by a <sup> tag with class="reference".

Hey, I rewrote the app. There's an option now to exclude elements by class name from parsing, and it's set to “reference” by default to exclude those links (the list can be extended by adding more class names, separated by a comma). This fixes the issue.