wpoa/JATS-to-Mediawiki

what to do about equations relying on images

notconfusing opened this issue · 11 comments

What's the PMCID? If the XML contains the source TeX or MathML, then it should be rendered with MathJax on the wiki.

PMCID: PMC3166320

https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Ranking_Candidate_Disease_Genes_from_Gene_Expression_and_Protein_Interaction_A_Katz-Centrality_Based_Approach

Max Klein
http://notconfusing.com/

On Fri, Jul 25, 2014 at 5:04 PM, Chris Maloney notifications@github.com
wrote:

What's the PMCID? If the XML contains the source TeX or MathML, then it
should be rendered with MathJax on the wiki.


Reply to this email directly or view it on GitHub
#22 (comment)
.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3166320
does not have anything other than the images, it seems. Example:

Then we drop the superscript and write Eq. (2) on matrix format as
<disp-formula><graphic xlink:href="pone.0024306.e003"/><label>(3)</label></disp-formula>
where <bold>d</bold>  =  (1,…,1)<italic><sup>T</sup></italic>. Which gives
<disp-formula><graphic xlink:href="pone.0024306.e004"/><label>(4)</label></disp-formula>

The same goes for the XML available from PLOS directly. Seems to be a clear case for JATS4R. Will open a ticket there and have it point here.

if this is detecatble, which it seems it is from
<disp-formula><graphic..> then we can detect and warn with upload.

Max Klein
http://notconfusing.com/

On Fri, Jul 25, 2014 at 11:36 PM, Daniel Mietchen notifications@github.com
wrote:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3166320
does not have anything other than the images, it seems. Example:

. Then we drop the superscript and write Eq. (2) on matrix format as(3)where d  =  (1,…,1)T. Which gives(4)


Reply to this email directly or view it on GitHub
#22 (comment)
.

Here is another example, in which <disp-formula><graphic..> is not used:

<p>Haplotype diversity was estimated as
</p>
<p><inline-graphic xlink:href="1471-2156-5-26-i1.gif"/>
</p>

(from https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Most_of_the_extant_mtDNA_boundaries_in_South_and_Southwest_Asia_were_likely_shaped_during_the_initial_settlement_of_Eurasia_by_anatomically_moder#Data_analysis ).

I have not found any relevant open source tools for this.

Using OCR on the PDF of such articles (via peerlibrary, which is stable) is very inaccurate (e.g. an upper-case Sigma character is matched with an upper-case "X" character): https://peerlibrary.org/p/ycXY3dk2LFGHsDfE2

There is one extraction library, but it requires the image files to contain original TeX or LaTeX in the file metadata, otherwise it won't work. It doesn't work with images from the named PLOS article above. http://www-cdf.fnal.gov/~cplager/latex/#png

Converting such many equations into accurate TeX, MML or equivalent source text can be achieved in two ways:

  • Manually re-write equations based on PNGs (feasible, time-consuming).
  • Somehow receive source text for math from publishers on per-journal basis (not impossible, but probably impractical).
  • Extract equations from raster images somehow (seems unlikely)

Hmm, just had a clever thought.

Can we upload the equation images to Wikisource? In effect, these raster images are non-source text that need to be manually transcribed because the academic record does not currently preserve these data in a "free as in freedom" (i.e. plain text, machine-readable, re-usable) format.

In a way, this brings our project also very close to the common use case for Wikisource -- transcription!

There is not a better place for manual (or assisted or programmatic) transcription of license-compatible academic text that is stored in a non-usable format.

You get the same thing from scanning the first issues of Nature, as you would from including bitmap (PNG) files along with a digital open access article.

I think it qualifies under the guidelines and we should move in this direction: https://en.wikisource.org/wiki/Wikisource:Image_guidelines

Thanks for this one. I am so used to putting every media file up on Commons that I hadn't even considered the possibility of putting the equation images on Wikisource, but I agree that this sounds like a good solution.

Great!

And to address Max's initial concern, the Wikipedia guidelines show that vertical-align:middle; is the preferred display CSS for inline math, and the fact that equations may increase line-height (aka leading) is to be expected:
https://en.wikipedia.org/wiki/Wikipedia:Math#Alignment_with_normal_text_flow

So I think displaying these images inline as-is, and if necessary, wrapping with <span style="vertical-align:middle;">, </span> should be sufficient.

Does this path work then?