appledora/mwparserfromhtml

add function to extract templates to library

appledora opened this issue · 2 comments

Likely the most complex element to extract. Appears in mainly two forms:

  1. as href of Wikilinks : also has a similar case, where WIKILINK hrefs have Category links - in which case we don't consider those elements as Categories. So, we have to make a decision regarding what can we do about such Template links.
  2. nested inside data-mw : is complicated to process, due to the asymmetric nature of the data-mw dictionary, JSON decoding errors due to the presence of escape characters, malformed strings, and bad use(single/double) of quotation mark etc.

From the example article 4, found 9 elements with the data-mw dictionary (expected template count 6) :

  • <style about="#mwt20" data-mw='{"name":"templatestyles","attrs":{"src":"Module:Citation/CS1/styles.css"},"body":{"extsrc":""}}' data-mw-deduplicate="TemplateStyles:r1067248974" typeof="mw:Extension/templatestyles">

  • <link about="#mwt23" data-mw='{"parts":[{"template":{"target":{"wt":"authority control","href":"./Template:Authority_control"},"params":{},"i":0}}]}' href="./Category:AC_with_0_elements" id="mwOQ" rel="mw:PageProp/Category" typeof="mw:Transclusion"/> -> Template

  • <style about="#mwt1" data-mw='{"parts":[{"template":{"target":{"wt":"Other people","href":"./Template:Other_people"},"params":{"1":{"wt":"William Clark"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion"> -> Template

  • <span about="#mwt5" data-mw='{"parts":[{"template":{"target":{"wt":"Use dmy dates","href":"./Template:Use_dmy_dates"},"params":{"date":{"wt":"November 2021"}},"i":0}}]}' id="mwBQ" typeof="mw:Nowiki mw:Transclusion"> </span> -> Template

  • <span about="#mwt8" data-mw='{"parts":[{"template":{"target":{"wt":"Use British English","href":"./Template:Use_British_English"},"params":{"date":{"wt":"July 2012"}},"i":0}}]}' id="mwBw" typeof="mw:Nowiki mw:Transclusion"> -> Template

  • <sup about="#mwt12" class="mw-ref reference" data-mw='{"name":"ref","attrs":{},"body":{"id":"mw-reference-text-cite_note-1"}}' id="cite_ref-1" rel="dc:references" typeof="mw:Extension/ref">

  • <style about="#mwt13" data-mw='{"parts":[{"template":{"target":{"wt":"reflist","href":"./Template:Reflist"},"params":{},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1011085734" id="mwLw" typeof="mw:Extension/templatestyles mw:Transclusion"> -> Template

  • <div about="#mwt19" class="mw-references-wrap" data-mw='{"name":"references","attrs":{"group":"","responsive":"1"},"body":{"html":""}}' id="mwMQ" typeof="mw:Extension/references">

  • <span about="#mwt20" class="noviewer" data-mw='{"parts":[{"template":{"target":{"wt":"DNB","href":"./Template:DNB"},"params":{"wstitle":{"wt":"Clark, William (1821-1880)"}},"i":0}}]}' id="mwNw" typeof="mw:Transclusion mw:Image"> -> Template

In GitLab by @geohci on Jul 7, 2022, 23:01

The attribute names can be changed and all Template objects would also have the raw HTML string attribute too in case the user wants to extract other information. As discussed, can identify templates by having the template key in the data-mw['parts'] dictionary. Thoughts on these nine nodes:

<style about="#mwt20" data-mw='{"name":"templatestyles","attrs":{"src":"Module:Citation/CS1/styles.css"},"body":{"extsrc":""}}' data-mw-deduplicate="TemplateStyles:r1067248974" typeof="mw:Extension/templatestyles">

Not a template -- don't need to do anything with it!

<link about="#mwt23" data-mw='{"parts":[{"template":{"target":{"wt":"authority control","href":"./Template:Authority_control"},"params":{},"i":0}}]}' href="./Category:AC_with_0_elements" id="mwOQ" rel="mw:PageProp/Category" typeof="mw:Transclusion"/> -> Template

Template(id='mwt23', name='authority control', href='./Template:Authority_control')

We can ignore the Category here -- that'll be covered up by a Category object.

<style about="#mwt1" data-mw='{"parts":[{"template":{"target":{"wt":"Other people","href":"./Template:Other_people"},"params":{"1":{"wt":"William Clark"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion"> -> Template

Template(id='mwt1', name='Other people', href='./Template:Other_people')

<span about="#mwt5" data-mw='{"parts":[{"template":{"target":{"wt":"Use dmy dates","href":"./Template:Use_dmy_dates"},"params":{"date":{"wt":"November 2021"}},"i":0}}]}' id="mwBQ" typeof="mw:Nowiki mw:Transclusion"> </span> -> Template

Template(id='mwt5', name='Use dmy dates', href='./Template:Use_dmy_dates')

<span about="#mwt8" data-mw='{"parts":[{"template":{"target":{"wt":"Use British English","href":"./Template:Use_British_English"},"params":{"date":{"wt":"July 2012"}},"i":0}}]}' id="mwBw" typeof="mw:Nowiki mw:Transclusion"> -> Template

Template(id='mwt8', name='Use British English', href='./Template:Use_British_English')

<sup about="#mwt12" class="mw-ref reference" data-mw='{"name":"ref","attrs":{},"body":{"id":"mw-reference-text-cite_note-1"}}' id="cite_ref-1" rel="dc:references" typeof="mw:Extension/ref">

Not a template -- can skip.

<style about="#mwt13" data-mw='{"parts":[{"template":{"target":{"wt":"reflist","href":"./Template:Reflist"},"params":{},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1011085734" id="mwLw" typeof="mw:Extension/templatestyles mw:Transclusion"> -> Template

Template(id='mwt13', name='reflist', href='./Template:Reflist')

  • <div about="#mwt19" class="mw-references-wrap" data-mw='{"name":"references","attrs":{"group":"","responsive":"1"},"body":{"html":""}}' id="mwMQ" typeof="mw:Extension/references">

Not a template -- can skip.

  • <span about="#mwt20" class="noviewer" data-mw='{"parts":[{"template":{"target":{"wt":"DNB","href":"./Template:DNB"},"params":{"wstitle":{"wt":"Clark, William (1821-1880)"}},"i":0}}]}' id="mwNw" typeof="mw:Transclusion mw:Image"> -> Template

Template(id='mwt20', name='DNB', href='./Template:DNB')