wpoa/JATS-to-Mediawiki

Taxonomic names using `<named-content>` need spaces when more than one term

Daniel-Mietchen opened this issue · 6 comments

hmm, this is strange, any thoughts on this issue?

Issue seems not to be <italic> per se, for which the XSLT is written in the same way as <bold> (both are successful in preserving spaces), but instead that there already there are no spaces separating the taxonomic elements used to wrap the animal name, as you can see here in the PMC NXML:

<italic><named-content content-type="taxon-name"><named-content content-type="genus">Johngarthia</named-content>
<named-content content-type="species">planata</named-content></named-content></italic>

The original article XML is written thusly:

<tp:taxon-name-part taxon-name-part-type="genus">Johngarthia</tp:taxon-name-part>
<tp:taxon-name-part taxon-name-part-type="species">planata</tp:taxon-name-part>

from http://bdj.pensoft.net/lib/ajax_srv/article_elements_srv.php?action=download_xml&item_id=1161

Perhaps it is prudent to handle the <named-content> tag, specifically when a taxon-name or species value is used for the content-type="" attribute. Right now, it seems these named-content tags are removed, but if there are multiple tags, then they could be replaced by a space character instead.

FWIW, it shows up in Entrez search results with the same problem: http://www.ncbi.nlm.nih.gov/pmc/?term=PMC4092324.

I'll take a look to see if I can fix it.

@wrought , the NXML looks okay to me. It has a newline, which should be preserved/normalized, I think. In other words, I think newlines should be converted into spaces in the wikitext.

The problem is this: <xsl:strip-space elements="*"/>, which causes all spaces inside elements in the input to be stripped in a pretty draconian way. Unfortunately, I don't see an easy solution ... changing named-content to preserve breaks the wikitext.

Okay, I think I fixed this one special case, but I'm very worried that I broke something else.

Whitespace handling is one of the truly hard problems in document processing, and right now what we have is a pretty bad hack job.

We really need a test framework, where we can put some regression tests in, so we can be sure we're not breaking other things as we work on this.