danfickle/openhtmltopdf

MathML Support

m-a-t opened this issue · 14 comments

m-a-t commented

MathML seems to be the technology how to embed formulas in web pages.

Consider adding support in openhtmltopdf.

There is a project based on flyingsaucer which adds this support: https://mvnrepository.com/artifact/com.github.rjolly/flying-saucer/9.1.1
(However, GPL licensed) Maybe this rjolly can contribute it to openhtmltopdf

Hi mat,

I would be pleased to do so, however there are some caveats. Regarding the license, it is not an issue. I choose GPL because my fork is mainly a browser and not a library. But I do not even know if I have the right to do this, as flyingsaucer is LGPL. So do not hesitate do borrow anything from my project. The commit relevant to MathML addition is rjolly/flying-saucer@f4c4a9d

However, at the moment, it works only for screen rendering, and not PDF. The reason is, that I am using JEuclid, which itself uses FOP, whereas flying-saucer uses iText. So having it work in openhtmltopdf/PDF-Box is a little more work than it seems I think.

@rjolly Only having "screen" rendering, i.e. drawing to a Graphics2D is perfectly fine. This is what PDFBox-Graphics2D is for. Even integrating JEuclid into openhtmltopdf for a quick test is easy using object drawers #78:

        objectDrawerFactory.registerDrawer("custom/mathml", { e, x, y, width, height, outputDevice, ctx, dotsPerPixel
            ->
            val dummy: CustomEvent? = null
            var src = ""
            for (i in 0..e.childNodes.length) {
                val item = e.childNodes.item(i)
                if (item is org.w3c.dom.CharacterData)
                    src = item.data
            }
            if (src.startsWith("<["))
                src = src.substring(2)
            val node = net.sourceforge.jeuclid.parser.Parser.getInstance().parse(StreamSource(StringReader(src)))
            val jeuclidDom = DOMBuilder.getInstance().createJeuclidDom(node)
            val realWidth = width / dotsPerPixel
            val realHeight = height / dotsPerPixel
            outputDevice.drawWithGraphics(x.toFloat(), y.toFloat(), realWidth.toFloat(), realHeight.toFloat(), { gfx ->
                /*
                 * Scale
                 */
                val viewer = jeuclidDom.defaultView
                //gfx.scale(realWidth / viewer.width, realHeight / (viewer.ascentHeight + viewer.descentHeight))

                /*
                 * And paint
                 */
                viewer.draw(gfx, 0f, 20f)
            });
        })
	<object type="custom/mathml" content="" class="mathml">
		<![CDATA[
		<math xmlns="http://www.w3.org/1998/Math/MathML">
			<mi>W</mi>
			<mo>&#x2009;</mo>
			<mo>=</mo>
			<mo>&#x2009;</mo>
			<mfrac>
				<mrow>
					<mi>Q</mi>
					<mo>&#x2009;</mo>
					<mo>x</mo>
					<mo>&#x2009;</mo>
					<mn>100</mn>
				</mrow>
				<mi>G</mi>
			</mfrac>
		</math>
		]]>
	</object>

This code is written in Kotlin, but you should get the idea. The work needed to cleanly integrate it into OpenHTMLToPDF to handle the math-tag is not that much. It would work similar to the SVG integration.

BUT: JEuclid does not work on JDK9. It makes some strange reflection stuff to register its elements. I only had this working on JDK8. As nearly all my projects are JDK9 now (or at least targeting JDK 9) this is a show stopper for me. For the time being i faked the fractions I needed using a table...

So to integrate MathML support into OpenHTMLToPDF you would first need to:

  • Fork JEuclid and fix it for JDK9
  • Move the classes to your own domain packages (i.e. moving net.sourceforge.jeuclid to e.g. com.github.rjolly.jeuclid)
  • Release it on maven central (which is only possible if you release it with your own domain prefix).

As I don't really need MathML support I did not consider this future.

I just started working on this. However, as well as the Java 9 problem, JEuclid depends on Batik 1.7 while our SVG support required Batik 1.9. This would make it difficult or impossible to build a project that requires both. We really need a fork as @rototor suggests. Anyone game?

@danfickle I'll give it a try, I will fork jeuclid and try to get it working with Batik 1.9 and Java 9.

@danfickle I got it working in this fork: https://github.com/rototor/jeuclid - the main problem with JDK 9 is, that batik exports some org.w3c.event-classes, and JDK 9 has the XML Module which also defines some classes in this namespace. With Java 9 you can not extend packages any more, when they are defined in a module. So the additional classes batik defines for org.w3c.event are not found any more on Java 9 ....

You can checkout the source and do a mvn install. You could then use the version 3.1.10-SNAPSHOT as dependency.

I'll make a de.rototor.jeuclid release, but I must first cleanup and rename some stuff in jeuclid for tjat. I don't think that @maxberger would like to release jeuclid with the old namespace with this changes, as it looses some feature (mainly dynamic DOM change support). Also I'll remove FOP and SWT support. So it can take some time till there is an official maven central artefact.

@danfickle I've just released version 3.1.11 of de.rototor.jeuclid (had a test driver problem with 3.1.10, so the release:perform did not finish ...)

Feel free to start the integration work.

@maxberger Sorry I wont take maintainer ship over JEuclid, as I only need the subset needed for the integration in openhtmltopdf. I plan to add a small (La-)TeX -> MathML converter, to enable the usage of TeX math to enter formulars in OpenHTMLToPDF, because MathML is - sorry - a tag mess, especially if you auto format the HTML sources in IntellJ. So my usage of JEuclid will be purely as a renderer.

@danfickle Are you going to work on this? Otherwise I will try to integrate this. For LaTeX integration I found SmuggleTex which is not yet available in maven central and uses JEuclid under the hood. I will likely fork it and release it on maven central similar to JEuclid.

Thanks @m-a-t @maxberger @rjolly @rototor

@rototor - sorry I missed your offer of help on this, however the good news is that there is plenty todo! I've uploaded something that works on one example but still todo are:

  • fonts
  • sizing
  • do we include the maths font stix
  • Java 2D support
  • MathML XML entities
  • Documentation

@rototor - I have been investigating fonts for MathML and there are a couple of issues:

  • The DefaultFontFactory picks up fonts from the environment. This is a recipe for a mess as it may work on development environments and not on servers, etc. The simplest solution would probably be to move the two constructor methods to separate methods.
  • The font factory is set once per JVM to a DefaultFontFactory instance. This means that in theory one can not use different font setups for different runs. This is probably a theoretical issue, but if you thought it was worth fixing, the easiest way might be to add a method on FontFactory to set a thread FontFactory and store it in a thread local. Then on getInstance, return the thread local copy or the default if the thread local is null.

Thanks, and let me know if you need a pull-request for this.

@danfickle Feel free to send me a pull request for this. It should be possible to change the font factory per thread, as otherwise this is only going to be a big mess in a container (e.g. Tomcat, JBoss, ...) environment.

So FontFactory.getInstance() should be ThreadLocal and also should be settable. Of course it should default the the DefaultFontFactory. And when doing the cleanup in the PdfBoxRenderer it should be set back to null if possible to avoid memory/class loader leaks. But this is something which I can do later on.

I wont have time to look into this myself till next weekend. But if you send me a pull request I can release a new version, as this does not take that long.

The STIX 2.0.0 math fonts were only available in otf format, so I ran them through FontSquirrel to generate truetype versions and upload them here together with a stylesheet to use them.

Note: Entity support is not yet baked in. Now MathML has entity support.

Please advice here if this font package is useful/has issues so we can consider adding it as resources in a future version.
stix-fonts.zip

<!DOCTYPE html PUBLIC
"-//OPENHTMLTOPDF//MATH XHTML Character Entities With MathML 1.0//EN"
"">
<html>
<head>
<link rel="stylesheet" href="stix-fonts/stylesheet.css" />
<style>
body {
 font-family: sans-serif;
}
math {
  width: 100%;
}
</style>
</head>
<body>
<h1>MathML</h1>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<mrow>
  <mi>x</mi>
  <mo>=</mo>
  <mfrac>
    <mrow>
      <mrow>
        <mo>-</mo>
        <mi>b</mi>
      </mrow>
      <mo>&#x022e3;</mo>
      <msqrt>
        <mrow>
          <msup>
            <mi>b</mi>
            <mn>2</mn>
          </msup>
          <mo>-</mo>
          <mrow>
            <mn>4</mn>
            <mo>&#8290;</mo>
            <mi>a</mi>
            <mo>&#8290;</mo>
            <mi>c</mi>
          </mrow>
        </mrow>
      </msqrt>
    </mrow>
    <mrow>
      <mn>2</mn>
      <mo>&#8290;</mo>
      <mi mathvariant="bold-italic">a</mi>
    </mrow>
  </mfrac>
</mrow>
</math>

<h2>End</h2>

</body>
</html>

Result:
mathml-support

MathML rendering support is now finished. Please open a new issue if you have any issues with it. Thanks.