what to do about equations relying on images

Question

what to do about equations relying on images

notconfusing opened this issue 10 years ago · 11 comments

they don't need to go on commons,
but it also wont work to upload them locally, because they won\t look right not inline.
as in
https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Ranking_Candidate_Disease_Genes_from_Gene_Expression_and_Protein_Interaction_A_Katz-Centrality_Based_Approach

what to do @Daniel-Mietchen ?

Answer 1 · 2014-07-25T15:04:55.000Z

What's the PMCID? If the XML contains the source TeX or MathML, then it should be rendered with MathJax on the wiki.

Answer 2 · 2014-07-25T15:31:09.000Z

PMCID: PMC3166320

https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Ranking_Candidate_Disease_Genes_from_Gene_Expression_and_Protein_Interaction_A_Katz-Centrality_Based_Approach

Max Klein
‽ http://notconfusing.com/

On Fri, Jul 25, 2014 at 5:04 PM, Chris Maloney notifications@github.com
wrote:

What's the PMCID? If the XML contains the source TeX or MathML, then it
should be rendered with MathJax on the wiki.

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

Answer 3 · 2014-07-25T21:36:35.000Z

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3166320
does not have anything other than the images, it seems. Example:

Then we drop the superscript and write Eq. (2) on matrix format as
<disp-formula><graphic xlink:href="pone.0024306.e003"/><label>(3)</label></disp-formula>
where <bold>d</bold>  =  (1,…,1)<italic><sup>T</sup></italic>. Which gives
<disp-formula><graphic xlink:href="pone.0024306.e004"/><label>(4)</label></disp-formula>

The same goes for the XML available from PLOS directly. Seems to be a clear case for JATS4R. Will open a ticket there and have it point here.

Answer 4 · 2014-07-28T12:22:38.000Z

if this is detecatble, which it seems it is from
<disp-formula><graphic..> then we can detect and warn with upload.

Max Klein
‽ http://notconfusing.com/

On Fri, Jul 25, 2014 at 11:36 PM, Daniel Mietchen notifications@github.com
wrote:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC3166320
does not have anything other than the images, it seems. Example:

. Then we drop the superscript and write Eq. (2) on matrix format as(3)where d = (1,…,1)^T. Which gives(4)

—
Reply to this email directly or view it on GitHub
#22 (comment)
.

Answer 5 · 2014-07-28T23:41:03.000Z

Here is another example, in which <disp-formula><graphic..> is not used:

<p>Haplotype diversity was estimated as
</p>
<p><inline-graphic xlink:href="1471-2156-5-26-i1.gif"/>
</p>

(from https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Most_of_the_extant_mtDNA_boundaries_in_South_and_Southwest_Asia_were_likely_shaped_during_the_initial_settlement_of_Eurasia_by_anatomically_moder#Data_analysis ).

Answer 6 · 2014-07-29T00:57:19.000Z

An example with loads of formula images is at https://en.wikisource.org/wiki/Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/In_Silico_Gene_Prioritization_by_Integrating_Multiple_Data_Sources .

Answer 7 · 2014-07-31T07:16:01.000Z

I have not found any relevant open source tools for this.

Using OCR on the PDF of such articles (via peerlibrary, which is stable) is very inaccurate (e.g. an upper-case Sigma character is matched with an upper-case "X" character): https://peerlibrary.org/p/ycXY3dk2LFGHsDfE2

There is one extraction library, but it requires the image files to contain original TeX or LaTeX in the file metadata, otherwise it won't work. It doesn't work with images from the named PLOS article above. http://www-cdf.fnal.gov/~cplager/latex/#png

Converting such many equations into accurate TeX, MML or equivalent source text can be achieved in two ways:

Manually re-write equations based on PNGs (feasible, time-consuming).
Somehow receive source text for math from publishers on per-journal basis (not impossible, but probably impractical).
Extract equations from raster images somehow (seems unlikely)

Answer 8 · 2014-07-31T07:26:56.000Z

Hmm, just had a clever thought.

Can we upload the equation images to Wikisource? In effect, these raster images are non-source text that need to be manually transcribed because the academic record does not currently preserve these data in a "free as in freedom" (i.e. plain text, machine-readable, re-usable) format.

In a way, this brings our project also very close to the common use case for Wikisource -- transcription!

There is not a better place for manual (or assisted or programmatic) transcription of license-compatible academic text that is stored in a non-usable format.

You get the same thing from scanning the first issues of Nature, as you would from including bitmap (PNG) files along with a digital open access article.

Answer 9 · 2014-07-31T07:33:51.000Z

I think it qualifies under the guidelines and we should move in this direction: https://en.wikisource.org/wiki/Wikisource:Image_guidelines

Answer 10 · 2014-07-31T09:46:23.000Z

Thanks for this one. I am so used to putting every media file up on Commons that I hadn't even considered the possibility of putting the equation images on Wikisource, but I agree that this sounds like a good solution.

Answer 11 · 2014-07-31T12:10:57.000Z

Great!

And to address Max's initial concern, the Wikipedia guidelines show that vertical-align:middle; is the preferred display CSS for inline math, and the fact that equations may increase line-height (aka leading) is to be expected:
https://en.wikipedia.org/wiki/Wikipedia:Math#Alignment_with_normal_text_flow

So I think displaying these images inline as-is, and if necessary, wrapping with <span style="vertical-align:middle;">, </span> should be sufficient.

Does this path work then?