mathjax/MathJax-demos-node

Rendering Multiple Pages

swamidass opened this issue · 4 comments

I have https://github.com/mathjax/MathJax-demos-node/blob/master/simple/tex2svg-page up and running, but I'd like to know how to modify it to process multiple pages.

Do I need to call "init" for each new page? How do I start up the model for each page? Do I need to avoid concurrency?

dpvc commented

Do I need to call "init" for each new page?

It is probably possible to do so, but you don't need to. Instead, I'd call MathJax.startup.getComponents() when you need to start the new file in order to set up a new document with clean input and output jax (so that, for example, you don't have definitions or equation numbers and labels from one page bleeding into the next page).

How do I start up the model for each page?

I give an example script below.

Do I need to avoid concurrency?

Well, since javascript is single threaded, there isn't much concurrency possible. MathJax will only pause (and allow other javascript to run) if it needs to load an extension, which is generally asynchronous in the browser, and mediated by promises. So if you want to allow autoloading of extensions or use of the \require macro, you want to use the promise-based calls, and use those to coordinate your processing. With the "simple" approach, as you are using here, you do need to be concerned about calling getDocument() again before the previous one is completed, so proper handling of the promises is important. With the "direct" approach, you could be more flexible with that.


Here's an example of how to process multiple files:

#! /usr/bin/env -S node -r esm

//
//  The default TeX packages to use
//
const PACKAGES = 'base, autoload, require, ams, newcommand';

//
//  Get the command-line arguments
//
var argv = require('yargs')
    .demand(0).strict()
    .usage('$0 [options] < filelist')
    .options({
        em: {
            default: 16,
            describe: 'em-size in pixels'
        },
        ex: {
            default: 8,
            describe: 'ex-size in pixels'
        },
        packages: {
            default: PACKAGES,
            describe: 'the packages to use, e.g. "base, ams"; use "*" to represent the default packages, e.g, "*, bbox"'
        },
        fontCache: {
            default: 'global',
            describe: 'cache type: local, global, or none'
        },
        dist: {
            boolean: true,
            default: false,
            describe: 'true to use webpacked version, false to use MathJax source files'
        }
    })
    .argv;

//
//  Read the HTML file names
//
const fs = require('fs');
const htmlfiles = fs.readFileSync(0, 'utf8').split(/[\n\r]+/);

//
//  Load and initialize MathJax
//
require('./components/src/node-main/node-main.js').init({
    //
    //  The MathJax configuration
    //
    loader: {
        source: (argv.dist ? {} : require('./components/src/source.js').source),
        load: ['adaptors/liteDOM', 'tex-svg']
    },
    tex: {
        packages: argv.packages.replace('\*', PACKAGES).split(/\s*,\s*/)
    },
    svg: {
        fontCache: argv.fontCache,
        exFactor: argv.ex / argv.em
    },
    'adaptors/liteDOM': {
        fontSize: argv.em
    },
    startup: {
        document: '',
        typeset: false
    }
}).then(async (MathJax) => {
  const startup = MathJax.startup;
  //
  // Loop through the file names...
  //
  for (const filename of htmlfiles) {
    if (!filename) continue;

    //
    //  Get the contents of the file
    //    and re-initialize MathJax for it
    //
    const lines = fs.readFileSync(filename, 'utf8');
    MathJax.config.startup.document = lines;
    startup.getComponents();
    //
    //  Typeset the math (allowing for asynchronous loading of packages, if needed)
    //
    await MathJax.typesetPromise();
    //
    //  Remove the SVG global cache, if there is no math on the page
    //
    const adaptor = MathJax.startup.adaptor;
    const html = MathJax.startup.document;
    if (Array.from(html.math).length === 0) {
      adaptor.remove(html.outputJax.svgStyles);
      const cache = adaptor.elementById(adaptor.body(html.document), 'MJX-SVG-global-cache');
      if (cache) adaptor.remove(cache);
    }
    //
    //  Write the page to a new HTML file
    //
    const nlines = adaptor.doctype(html.document) + '\n' + adaptor.outerHTML(adaptor.root(html.document));
    fs.writeFileSync(filename.replace(/.html/, '-new.html'), nlines, 'utf8')
  }
}).catch(err => console.log(err));

Thanks for the detailed answer! I found that using the "direct" examples as a starting worked well, with cleaner code...


const {mathjax} = require('mathjax-full/js/mathjax.js');
const {TeX} = require('mathjax-full/js/input/tex.js');
const {SVG} = require('mathjax-full/js/output/svg.js');
const {liteAdaptor} = require('mathjax-full/js/adaptors/liteAdaptor.js');
const {RegisterHTMLHandler} = require('mathjax-full/js/handlers/html.js');
const {AllPackages} = require('mathjax-full/js/input/tex/AllPackages.js');
require('mathjax-full/js/util/entities/all.js');


const adaptor = liteAdaptor({fontSize: 16});
RegisterHTMLHandler(adaptor);

const tex = new TeX({inlineMath: [['$', '$'], ['\\(', '\\)']]});
const svg = new SVG({fontCache: "local", exFactor: 0.5});

async function render_mathjax(html) {
  const mj = mathjax.document(html, {InputJax: tex, OutputJax: svg});
  mj.render();
  html = adaptor.doctype(mj.document) + "\n" ;
  html += adaptor.outerHTML(adaptor.root(mj.document));
  return html;
}

That last function, render_mathjax, works in parallel on multiple documents.

Are there any side effects I should be wary of here?

dpvc commented

Are there any side effects I should be wary of here?

Yes, there are a number of issues, here.

Since you reusing the input and output jax, they will retain anyone state from the previous document that you used (this is useful if you are building a common CSS file that covers several HTML files, for example). That means that things like macro definitions and label definitions (and other similar values) from one file will still be in place when a second file is processed. That can cause unwanted results in the second file (like duplicate label errors, for example, if the second file uses the same label as the first).

There is also some state in the output jax (more so for CHTML=, but still some for SVG). In particular, once the output jax adds its style sheet to the document, it won't do so again, and so the styles will not be present in the second and later files.

Some of this could be reset using mj.reset({all: true}) before mj.render(), but that won't reset the macros. So it is better if you re-instantiate the input and output jax in between files, so that you are sure to have a clean version of each.

Another problem is that, although you have loaded the AllPackages file, you have not configured the input jax to use any of the packages, so only the base package is used, and many macros will not be defined. You need to include packages: AllPackages in the tex configuration section to do that.

Since you are processing pages at a time, I would recommend using fontCache: 'global', since that will reduce the size of the files (characters will only need to be stored once for the whole page rather than once per expression). Unless you are planning to extract the individual SVG files, this will be much more efficient.

Finally, there are no asynchronous actions in your render_mathjax() function, so out does not need to be async. You say it works "in parallel" with multiple documents, but that is not really what happens, as the function runs atomically (no other javascript runs until it completes), so even if you end up calling it multiple times via promises, you just get a lot of pending calls that each run serially. That just uses resources needlessly stacking up the promises, so you might as well just run them serially right off the bat. That will be more efficient. (And if you were trying to handle asynchronous loading issues in MathJax, which you aren't since you have loaded all the package and entities directly, you would not be able to share input and output jax among the different documents.)

So I'd recommend the following:

function render_mathjax(html) {
  const tex = new TeX({
    inlineMath: [['$', '$'], ['\\(', '\\)']],
    packages: AllPackages
  });
  const svg = new SVG({fontCache: "global", exFactor: 0.5});
  const mj = mathjax.document(html, {InputJax: tex, OutputJax: svg});
  mj.render();
  html = adaptor.doctype(mj.document) + "\n" ;
  html += adaptor.outerHTML(adaptor.root(mj.document));
  return html;
}

Just call render_mathjax() one each file name one at a time. There is no concurrency to be hand, here.

Thank you. Very much appreciated.

I'd suggest making something like this function available in the main API, as it does seem to be a common use case.