add function to extract templates to library
appledora opened this issue · 2 comments
Likely the most complex element to extract. Appears in mainly two forms:
- as
href
of Wikilinks : also has a similar case, where WIKILINKhref
s haveCategory
links - in which case we don't consider those elements as Categories. So, we have to make a decision regarding what can we do about such Template links. - nested inside
data-mw
: is complicated to process, due to the asymmetric nature of thedata-mw
dictionary, JSON decoding errors due to the presence of escape characters, malformed strings, and bad use(single/double) of quotation mark etc.
From the example article 4, found 9 elements with the data-mw
dictionary (expected template count 6) :
-
<style about="#mwt20" data-mw='{"name":"templatestyles","attrs":{"src":"Module:Citation/CS1/styles.css"},"body":{"extsrc":""}}' data-mw-deduplicate="TemplateStyles:r1067248974" typeof="mw:Extension/templatestyles">
-
<link about="#mwt23" data-mw='{"parts":[{"template":{"target":{"wt":"authority control","href":"./Template:Authority_control"},"params":{},"i":0}}]}' href="./Category:AC_with_0_elements" id="mwOQ" rel="mw:PageProp/Category" typeof="mw:Transclusion"/>
-> Template -
<style about="#mwt1" data-mw='{"parts":[{"template":{"target":{"wt":"Other people","href":"./Template:Other_people"},"params":{"1":{"wt":"William Clark"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion">
-> Template -
<span about="#mwt5" data-mw='{"parts":[{"template":{"target":{"wt":"Use dmy dates","href":"./Template:Use_dmy_dates"},"params":{"date":{"wt":"November 2021"}},"i":0}}]}' id="mwBQ" typeof="mw:Nowiki mw:Transclusion"> </span>
-> Template -
<span about="#mwt8" data-mw='{"parts":[{"template":{"target":{"wt":"Use British English","href":"./Template:Use_British_English"},"params":{"date":{"wt":"July 2012"}},"i":0}}]}' id="mwBw" typeof="mw:Nowiki mw:Transclusion">
-> Template -
<sup about="#mwt12" class="mw-ref reference" data-mw='{"name":"ref","attrs":{},"body":{"id":"mw-reference-text-cite_note-1"}}' id="cite_ref-1" rel="dc:references" typeof="mw:Extension/ref">
-
<style about="#mwt13" data-mw='{"parts":[{"template":{"target":{"wt":"reflist","href":"./Template:Reflist"},"params":{},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1011085734" id="mwLw" typeof="mw:Extension/templatestyles mw:Transclusion">
-> Template -
<div about="#mwt19" class="mw-references-wrap" data-mw='{"name":"references","attrs":{"group":"","responsive":"1"},"body":{"html":""}}' id="mwMQ" typeof="mw:Extension/references">
-
<span about="#mwt20" class="noviewer" data-mw='{"parts":[{"template":{"target":{"wt":"DNB","href":"./Template:DNB"},"params":{"wstitle":{"wt":"Clark, William (1821-1880)"}},"i":0}}]}' id="mwNw" typeof="mw:Transclusion mw:Image">
-> Template
In GitLab by @geohci on Jul 7, 2022, 23:01
The attribute names can be changed and all Template objects would also have the raw HTML string attribute too in case the user wants to extract other information. As discussed, can identify templates by having the template
key in the data-mw['parts']
dictionary. Thoughts on these nine nodes:
<style about="#mwt20" data-mw='{"name":"templatestyles","attrs":{"src":"Module:Citation/CS1/styles.css"},"body":{"extsrc":""}}' data-mw-deduplicate="TemplateStyles:r1067248974" typeof="mw:Extension/templatestyles">
Not a template -- don't need to do anything with it!
<link about="#mwt23" data-mw='{"parts":[{"template":{"target":{"wt":"authority control","href":"./Template:Authority_control"},"params":{},"i":0}}]}' href="./Category:AC_with_0_elements" id="mwOQ" rel="mw:PageProp/Category" typeof="mw:Transclusion"/>
-> Template
Template(id='mwt23', name='authority control', href='./Template:Authority_control')
We can ignore the Category here -- that'll be covered up by a Category object.
<style about="#mwt1" data-mw='{"parts":[{"template":{"target":{"wt":"Other people","href":"./Template:Other_people"},"params":{"1":{"wt":"William Clark"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwAg" typeof="mw:Extension/templatestyles mw:Transclusion">
-> Template
Template(id='mwt1', name='Other people', href='./Template:Other_people')
<span about="#mwt5" data-mw='{"parts":[{"template":{"target":{"wt":"Use dmy dates","href":"./Template:Use_dmy_dates"},"params":{"date":{"wt":"November 2021"}},"i":0}}]}' id="mwBQ" typeof="mw:Nowiki mw:Transclusion"> </span>
-> Template
Template(id='mwt5', name='Use dmy dates', href='./Template:Use_dmy_dates')
<span about="#mwt8" data-mw='{"parts":[{"template":{"target":{"wt":"Use British English","href":"./Template:Use_British_English"},"params":{"date":{"wt":"July 2012"}},"i":0}}]}' id="mwBw" typeof="mw:Nowiki mw:Transclusion">
-> Template
Template(id='mwt8', name='Use British English', href='./Template:Use_British_English')
<sup about="#mwt12" class="mw-ref reference" data-mw='{"name":"ref","attrs":{},"body":{"id":"mw-reference-text-cite_note-1"}}' id="cite_ref-1" rel="dc:references" typeof="mw:Extension/ref">
Not a template -- can skip.
<style about="#mwt13" data-mw='{"parts":[{"template":{"target":{"wt":"reflist","href":"./Template:Reflist"},"params":{},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1011085734" id="mwLw" typeof="mw:Extension/templatestyles mw:Transclusion">
-> Template
Template(id='mwt13', name='reflist', href='./Template:Reflist')
<div about="#mwt19" class="mw-references-wrap" data-mw='{"name":"references","attrs":{"group":"","responsive":"1"},"body":{"html":""}}' id="mwMQ" typeof="mw:Extension/references">
Not a template -- can skip.
<span about="#mwt20" class="noviewer" data-mw='{"parts":[{"template":{"target":{"wt":"DNB","href":"./Template:DNB"},"params":{"wstitle":{"wt":"Clark, William (1821-1880)"}},"i":0}}]}' id="mwNw" typeof="mw:Transclusion mw:Image">
-> Template
Template(id='mwt20', name='DNB', href='./Template:DNB')