mathjax/MathJax-demos-node

tex2svg-page: error handling non-math symbols

dkumor opened this issue · 5 comments

Running tex2svg-page on the following html file succeeds:

<html>
  <body>
    <div>$$ x^2 $$</div>
  </body>
</html>

However, once you add the copyright symbol, anywhere in the page, it fails:

<html>
  <body>
    <div>$$ x^2 $$</div>
    &copy;
  </body>
</html>

with the following error:

MathJax(?): Can't find handler for document
(node:2116796) UnhandledPromiseRejectionWarning: Error: Cannot find module 'mathjax-full/es5/util/entities/c.js'

This behavior isn't changed even if the symbol is placed inside a footer tag, and the footer tag added to ignore in the mathjax configuration. If my understanding here is correct, this suggests two possible issues:

  1. There are missing files in the node mathjax distribution
  2. Mathjax processes areas of html that it should ignore

My use-case is pre-rendering math for a static website -I'm specifically looking for a script that will output an html file with math rendered in the same way as if the mathjax javascript were included on the page, and tex2svg-page seemed like a perfect fit here. I'd really appreciate any guidance you might have on these issues.

dpvc commented

Thanks for the report. I've looked into it and here is what is going on:

Because the number of entities (like &copy;) is fairly large, MathJax only has a limited number of them built in, and those are the ones designed for use in mathematics (e.g., &CircleDot; and &GreaterEqual;). The others are stored in external files that are loaded as needed. In your case, &copy; is not on the default list, so MathJax will try to load it when you use it.

In a browser, this loading is asynchronous, and so involves some careful hand-shaking to manage the delay while waiting for the file to arrive, and that code surrounding the call to translate the entity must be aware that the delay can occur (and so must any code surrounding that, and so on).

In node applications, since there is no browser DOM, MathJax uses a light-weight implementation of the browser DOM called the LiteDOM. When MathJax tries to determine which handler to use for a particular document, it asks the handlers it knows about to see if they can handle the document. For the LiteDOM handler, it tries to parse the document to see if it can do so, and if it can, then MathJax uses that handler.

But during the parsing of the document containing &copy;, the LiteDOM has to load the external file for the entities beginning with "c" (the file referenced in the error message you cite). That involves the asynchronous calls for loading files described above; but the code that is is looking for the handler doesn't expect the check to be asynchronous, and that leads to the crash that you are seeing.

That certainly is a bug, and I will have to think about how best to handle it.

In the meantime, one way to work around the problem is to load all the entities up front so that the definition of &copy; will be there when you need it. That can be done by adding

require('mathjax-full/js/util/entities/all.js');

to the tex2svg-page program after the line that loads AllPackages. That will allow the program to process the file. (I'm assuming you are using the copy in the direct subdirectory, here.)

This does lead to a second issue, however, which is that the output of the program will no longer have the entity, but instead will have the actual unicode character instead. This is because, just as in a browser, the LiteDOM translates entities to characters while it is parsing the file. This is necessary because you could have something like

When $x &lt; y$, we have ...

in your file, and the &lt; needs to be converted to <. It is even possible that you have

When &#x24;x &lt; y&#x24, we have...

and the &#x24; must be converted to $ before MathJax looks for math delimiters. So the entity translation is an important step in the processing of the page. That means &copy; will be translated to © during the parsing of the document, and will be output as a unicode character, not a named entity, in the final result.

If you want to have &copy; in the final result, you will have to convert back from unicode characters to entities. That can be done by adding

function toEntity(c) {
  return '&' + c.charCodeAt(0).toString(16).toUpperCase() + ';';
}

const LiteParser = require('mathjax-full/js/adaptors/lite/Parser.js').LiteParser;
LiteParser.prototype.protectHTML = function (text) {
  return text.replace(/&/g, '&amp;')
             .replace(/</g, '&lt;')
             .replace(/>/g, '&gt;')
             .replace(/[^\u0000-\u007E]/g, toEntity);
}

just after the require() statement I gave you above. That will cause all non ASCII characters to be rendered as entities. But they will be numeric entities (like &x#A9;) not named entities (like &copy;).

To get named entities, you can do something like this:

const entityName = {
  0xA9 : 'copy'
};

function toEntity(c) {
  const n = c.charCodeAt(0);
  return '&' + (entityName[n] || '#x' + n.toString(16).toUpperCase()) + ';';
}

const LiteParser = require('mathjax-full/js/adaptors/lite/Parser.js').LiteParser;
LiteParser.prototype.protectHTML = function (text) {
  return text.replace(/&/g, '&amp;')
             .replace(/</g, '&lt;')
             .replace(/>/g, '&gt;')
             .replace(/[^\u0000-\u007E]/g, toEntity);
}

where you list the entities that you want to turn back into named entities. It would also be possible to generate the entityName list from the original name-to-character mapping using

const entities = require('mathjax-full/js/util/Entities.js').entities;
const entityName = {};
Object.keys(entities).forEach((name) => entityName[entities[name].codePointAt(0)] = name);

rather than giving the list yourself. If you want to include (unencoded) unicode in your document, then you might need to adjust the regex in last replace() in the protectHTML function to exclude the characters you don't want encoded, or make the toEntity() function more sophisticated so that it only encodes the characters that you want to.

In any case, I think you can get the results you want this way.

Thanks for the detailed response! I am happy to allow unicode characters in the output, so the replacement code is not necessary for me.

However, I have tried the suggestion of adding require('mathjax-full/js/util/asyncLoad/node.js'); here: https://github.com/mathjax/MathJax-demos-node/blob/master/direct/tex2svg-page#L36 , but this did not fix the issue of crashing on &copy;. Here is the exact code used:

tex2svg-page modified code
#! /usr/bin/env -S node -r esm

/*************************************************************************
 *
 *  direct/tex2svg-page
 *
 *  Uses MathJax v3 to convert all TeX in an HTML document.
 *
 * ----------------------------------------------------------------------
 *
 *  Copyright (c) 2018 The MathJax Consortium
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */

//
//  Load the packages needed for MathJax
//
const mathjax = require('mathjax-full/js/mathjax.js').mathjax;
const TeX = require('mathjax-full/js/input/tex.js').TeX;
const SVG = require('mathjax-full/js/output/svg.js').SVG;
const liteAdaptor = require('mathjax-full/js/adaptors/liteAdaptor.js').liteAdaptor;
const RegisterHTMLHandler = require('mathjax-full/js/handlers/html.js').RegisterHTMLHandler;

const AllPackages = require('mathjax-full/js/input/tex/AllPackages.js').AllPackages;
require('mathjax-full/js/util/asyncLoad/node.js');

//
//  Get the command-line arguments
//
var argv = require('yargs')
    .demand(1).strict()
    .usage('$0 [options] file.html > converted.html')
    .options({
        em: {
            default: 16,
            describe: 'em-size in pixels'
        },
        ex: {
            default: 8,
            describe: 'ex-size in pixels'
        },
        packages: {
            default: AllPackages.sort().join(', '),
            describe: 'the packages to use, e.g. "base, ams"'
        },
        fontCache: {
            default: 'global',
            describe: 'cache type: local, global, none'
        }
    })
    .argv;

//
//  Read the HTML file
//
const htmlfile = require('fs').readFileSync(argv._[0], 'utf8');

//
//  Create DOM adaptor and register it for HTML documents
//
const adaptor = liteAdaptor({fontSize: argv.em});
RegisterHTMLHandler(adaptor);

//
//  Create input and output jax and a document using them on the content from the HTML file
//
const tex = new TeX({packages: argv.packages.split(/\s*,\s*/)});
const svg = new SVG({fontCache: argv.fontCache, exFactor: argv.ex / argv.em});
const html = mathjax.document(htmlfile, {InputJax: tex, OutputJax: svg});

//
//  Typeset the document
//
html.render();

//
//  Output the resulting HTML
//
console.log(adaptor.outerHTML(adaptor.root(html.document)));

The error message is:

node_modules/mathjax-full/js/core/HandlerList.js:1
Error: Can't find handler for document

Unfortunately I am not too familiar with the MathJax code - is something more than the require needed?

dpvc commented

Sorry, my fault. I copied the wrong line. It should have been

require('mathjax-full/js/util/entities/all.js');

I will change it in the original message, in case anyone else looks for the solution here.

Wonderful! This worked great! One little annoyance that was easy to work around is that tex2svg-page makes <!DOCTYPE html> tags disappear, which makes chrome go into quirks mode. I just made it prepend the output file with that tag.

For my purposes, this issue is solved. I will leave it open, since as I understand it does expose a bug, but feel free to close it once it is not useful.

Thank you very much for your help!

dpvc commented

Thanks for confirming that it worked for you.

Good luck with your project.