Unibeautify/sparser

Extend token preservation to script and style lexers via comments

Closed this issue · 4 comments

For beautification, the parsing output should provide sufficient information to round-trip from input file to parsed data to output file with no changes to the displayed text.

This means that each parsed token needs to retain the number of newlines between it and the previous parsed token and the specific whitespace characters that occur before the token - between the start of the line and the current token if it is the first token on the line, or between the previous token and the current token if it is not.

Round-tripping with full-fidelity, including whitespace after text, whitespaces on empty lines, and specific/mixed newline characters, might be a nice to have but is not required.

To help me better appreciate the nature of this problem could you provide a use case example? In most languages the specific white space immediately preceding a token isn't really important. In markdown, though, the specific white space preceding a particular token, especially at the start of a line, could be critically important, but I am accounting for this in the lexer logic.

@prettydiff - The whitespace isn't important to the language but it is often important to the user and for formatting.

Consider the case where the user has formatted an javascript array literal in some non-standard but useful way (see the preserve-array-indentation setting is js-beautify). In that case, tracking the whitespace before the token matters.

I think the way to handle that would be to specify preservation around a block of code so that the code structure in question is parsed as a single token with all white space. I already have this for markup and I could extend into the script and style lexers using an opening and closing comment pair.

The solution is published with documentation (though I just realized I copy/pasted a JavaScript code sample into the CSS documentation). It works like this:

  1. There must be a comment that matches this pattern: /^(\/(\/|\*)\s*parse-ignore-start)/ which is essentially a line or block comment that opens with optional white space immediately followed by the word parse-ignore-start.
  2. Everything including and following the opening comment will be included as a single parsed token of type ignore and presv of true.
  3. This ignored token will contain everything from the start of the opening comment through either the end of the closing comment or end of code sample.
  4. Ending code sample will be another comment nearly identical to the opening comment except the word is parse-ignore-end.

This feature is included in both the script and style lexers with identical effect. Here is a brief code example:

if (b[a] === "\n") {
    /* parse-ignore-start */
    if (options.lang === "apacheVelocity" && lex[0] === "#") {
        a = a - 1;
        break;
    }
    /* parse-ignore-end */
    parse.lineNumber = parse.lineNumber + 1;
}