mathjax/MathJax-demos-node

How to get the input TeX character ranges corresponding to SVG nodes?

mmmkkaaayy opened this issue · 5 comments

Hi, I'm rendering tex2svg like so (TypeScript, Electron, React)

import {mathjax} from 'mathjax-full/js/mathjax';
import {TeX} from 'mathjax-full/js/input/tex';
import {CHTML} from 'mathjax-full/js/output/chtml';
import {SVG} from 'mathjax-full/js/output/svg';
import {liteAdaptor} from 'mathjax-full/js/adaptors/liteAdaptor';
import {RegisterHTMLHandler} from 'mathjax-full/js/handlers/html';

componentDidMount() {
  const tex = new TeX({});
  const adaptor = liteAdaptor();
  RegisterHTMLHandler(adaptor);
  const svg = new SVG({});
  const html = mathjax.document('', {InputJax: tex, OutputJax: svg});
  const node = html.convert(this.props.tex, {});
  this.ref!.current!.innerHTML = adaptor.outerHTML(node);
}

I would like to find the corresponding character ranges in the input TeX for each SVG node that's rendered. I believe I should be using getMetrics() for this, but I'm not sure how. I tried console.log(html.getMetrics().math.toArray()); but it logs an empty array.

Side question: Is this the best way to render TeX in a React component? Which of the variables that I created can be used across multiple instances of the component?

dpvc commented

I'm not sure what "the corresponding character ranges" means. The getMetrics() function is for determing the em and ex sizes of the surrounding font, and the width of the container element (things that MathJax needs during its typesetting process), so I don't think that is what you want, here. But I'm really not sure what you do want. Can you be more explicit about your needs?

Is this the best way to render TeX in a React component?

You might consider the mathjax-react package, which implements react components for MathJax. See if that either works for you out for eh box, or you could look at that implementation and see if it gives you any ideas.

Sorry for the imprecise wording!

If I render the TeX input string a \cdot b, I want to have a mapping between the output rendered nodes and their indices in the original TeX string.

Something like:

'a': (0,1)
'\cdot': (2,7)
'b': (8,9)

But the keys 'a', '\cdot', and 'b' should actually be pointers to the rendered SVG DOM nodes.

I can see the fields start and end in the MathItem class, which I believe is what I'm looking for. I thought I could access them through the MathDocument returned by getMetrics(), but that didn't work out. Now I realize that the html variable is a MathDocument itself, so I tried html.math.toArray() but get an empty list still.

dpvc commented

The start and end properties in the MathItem object are references to the location of the original math expression within the document, and refer to the expression as a whole, not to individual pieces of an expression. Note that the math items in the html.math list are only those expressions that are in the document originally as TeX expressions with math delimiters like \(...\) and $$...$$; math that is concerted by hand via html.convert() are not part of the document, and so there are not MathItem elements for those in html.math. That is why that list is empty for you. It only gets populated via html.render() which typesets the page.

The mapping you are looking for is not maintained by MathJax, and would not be as straight-forward as you seem to suggest, as the elements produced are not in a one-to-one correspondence with the original LaTeX. Because TeX is a macro language, things can get more complicated. For example, if you use

$$\def\abc#1{\sin(#1)+#1} \abc{y}$$
$$\abc{z}$$

you will get two displayed equations, the first being sin(y) + y and the second being \sin(z) + z. In the first, where should the "sin" point back to? Should it be (11, 14), for the \sin within the definition of \abc? Or should it be (24,27) for the \abc call that generated it (and if so, is it to just \abc, or to \abs{y} as a whole? Should the element for the first "y" point to the #1 inside the \sin(#1), or should it be to the y in \abc{y}? What about the + in the output (is it to the + in the definition of \abc or to the call to \abc)?

Things get worse in the second expression, since it doesn't include the definition of \abc, so there is nothing in that expression to map the + back to other than to the \abc itself. Of course, macros can be much more complicated than this, such as the \ce macro for chemical expressions from the mhchem package. That takes a string and converts it to a completely different TeX expression, which is then parsed. So \ce{H20} becomes something like \mathrm{H}_2\mathrm{0}. I assume you would want the output "H" to map back to the H in the \ce call (not to the \ce itself), but this would require the mhchem package to track all the transformations that it makes on the original input in and pass those on somehow to the final TeX string that it produces. And what if the \ce call is itself inside some other macro call?

Trying to track all this back to the original LaTeX would require substantial rewriting of the internals of much of MathJax, which is not something that is likely to occur.

There are some techniques you can try to use, however, to tie the output back tot he input. There was a discussion some time ago in the MathJax user's forum that illustrated one approach to being able to tie mouse clicks on the output back to the original TeX. it involves adding \cssId{} macros to set ids on the elements you are interested in. You could do with this all elements in your expression, if you want.

If you are doing this for some sort of editor, then I suggest to you that you don't want to use LaTeX as the internal format of the expression you are editing. Rather, you probably want to use an internal "abstract syntax tree" (AST) that more naturally represents the expression. Then you could have a method that converts that AST to LaTeX (or to MathML, or any other needed format). You could use the \cssId approach to tag the items in the output using numbers that are associated with the individual nodes in the internal AST tree, for example, as a means of mapping the output back to the input. But trying to pull back to LaTeX code is going to be a nightmare, in general.


Note: this issue tracker is intended for reporting actual bugs in the demo code in this repository. General questions are better asked in the MathJax users forum, where you are likely to have more people see your question.

Thank you for the detailed response, and pointing me to the correct place to ask these sorts of questions in the future. The \cssId approach sounds really promising. I did not know this was even possible! I already have an AST which gets converted to LaTeX source, so it's pretty simple to include a \cssId wrapper while converting.

Is the discussion thread you were referring to this one? Your link doesn't work for me.

dpvc commented

No the discussion was this one. Sorry for the wrong link.