trishume/syntect

Fix the last few syntax tests

trishume opened this issue ยท 10 comments

After #55 fixed most syntax test exposed bugs, there's a few left. Now the syntax tests for almost all languages pass perfectly. The following languages still fail:

  • ASP: not sure why, think it is some weird thing with with_prototype. Speculation in #51
  • LaTeX: mostly works, but one pattern matches the empty string, should be fixed in Sublime Packages. Notes in #51 by @keith-hall
  • Markdown: one test fails on <a name=\"demo\"></a>. Not sure why. Might be that sublimehq/Packages#877 doesn't actually work correctly. Not sure if @keith-hall ever ran the syntax tests on it. These tests no longer fail because the syntax in the repo was updated, but there's still probably an underlying difference from Sublime here.
  • PHP: a number of failures, not sure why. Might be the same thing as ASP. Fixed in #103

If you plan on attempting to diagnose or fix one of these bugs, create an issue for it and reference it here. I will do the same.

I don't foresee myself personally doing much work on this for the next little while, but I will review code and discuss and may get around to it eventually.

I'm investigating the Markdown test failure under #62

I've logged a new issue at #104 which details why the ASP syntax tests are still failing, and an additional bug which doesn't show up in the official ST default packages repo.

I think the reason why the LaTeX syntax has a failure in syntect is due to how ST and syntect optimize the matching - ST seems to check "does this regex pattern match at this point", and if no patterns match, it advances the "point" to the next character and does all the work again, whereas syntect caches the next match position for each regex. When there are dodgy regex patterns in the syntax definition that consume no characters, syntect doesn't see that actually there is a match further along that will consume characters, and instead just moves on to the next pattern match that "does something".

i.e. we can see from the operations that the pattern that matches the empty string matches before the close brace pattern that pops the context

blank match `[A-Za-z[:digit:]-]*` at column 27 on line `\usepackage[args]{mypackage, anotherpackage}`
\usepackage[args]{mypackage, anotherpackage}
^ +meta.preamble.usepackage.latex
\usepackage[args]{mypackage, anotherpackage}
^ +keyword.control.preamble.latex
\usepackage[args]{mypackage, anotherpackage}
^ +punctuation.definition.backslash.latex
\usepackage[args]{mypackage, anotherpackage}
 ^ pop 1 [<punctuation.definition.backslash.latex>]
\usepackage[args]{mypackage, anotherpackage}
           ^ pop 1 [<keyword.control.preamble.latex>]
\usepackage[args]{mypackage, anotherpackage}
           ^ +meta.group.bracket.latex
\usepackage[args]{mypackage, anotherpackage}
           ^ +punctuation.definition.group.bracket.begin.latex
\usepackage[args]{mypackage, anotherpackage}
            ^ pop 1 [<punctuation.definition.group.bracket.begin.latex>]
\usepackage[args]{mypackage, anotherpackage}
                ^ +punctuation.definition.group.bracket.end.latex
\usepackage[args]{mypackage, anotherpackage}
                 ^ pop 1 [<punctuation.definition.group.bracket.end.latex>]
\usepackage[args]{mypackage, anotherpackage}
                 ^ pop 1 [<meta.group.bracket.latex>]
\usepackage[args]{mypackage, anotherpackage}
                 ^ +meta.preamble.usepackage.latex
\usepackage[args]{mypackage, anotherpackage}
                 ^ +meta.group.brace.latex
\usepackage[args]{mypackage, anotherpackage}
                 ^ +punctuation.definition.group.brace.begin.latex
\usepackage[args]{mypackage, anotherpackage}
                  ^ pop 1 [<punctuation.definition.group.brace.begin.latex>]
\usepackage[args]{mypackage, anotherpackage}
                  ^ pop 3 [<meta.preamble.usepackage.latex>, <meta.preamble.usepackage.latex>, <meta.group.brace.latex>]
\usepackage[args]{mypackage, anotherpackage}
                  ^ +meta.preamble.usepackage.latex
\usepackage[args]{mypackage, anotherpackage}
                  ^ +meta.group.brace.latex
\usepackage[args]{mypackage, anotherpackage}
                  ^ +support.class.latex
\usepackage[args]{mypackage, anotherpackage}
                           ^ pop 1 [<support.class.latex>]
\usepackage[args]{mypackage, anotherpackage}
                                           ^ +punctuation.definition.group.brace.end.latex (NOTE: the `,` is col 27)
\usepackage[args]{mypackage, anotherpackage}
                                            ^ pop 1 [<punctuation.definition.group.brace.end.latex>]
\usepackage[args]{mypackage, anotherpackage}
                                            ^ pop 2 [<meta.preamble.usepackage.latex>, <meta.group.brace.latex>]

I know we've already agreed that the syntax definition should be fixed and not syntect, and I just want to state that I still strongly agree with this, I just wanted to share more detail about why this difference arises - it seems silly to remove our optimization or have extra code for working around poorly coded regex patterns.

@keith-hall Interesting, thanks for the investigation.

I have a hope that perhaps in a few places Sublime uses a different model than syntect and that if we figure out that model and switch to it our code will get cleaner, more correct, and also possibly faster. For example #124 is much better now with the new comments, but it still makes me uneasy that all that special-case logic is necessary, it makes me wonder if Sublime does things in a way that doesn't require a bunch of special-cases. This will probably take a bunch of thinking and testing though, which I don't have time for yet, probably not till at least late December. The method of using "does this regex match at this point" that you mention sounds to me like it would be slower, but I can see how I might be wrong and it might be either just as fast or faster.

Ideally I think we should still strive for compatibility instead of just fixing the upstream packages, because it increases the likelihood of third party syntaxes working without modification, but it's definitely easier to fix upstream and perfect compatibility is a lot of effort that at least I personally have limited time to work on.

My knowledge of how Sublime works internally is of course not really knowledge at all but educated guesswork, based on my experiences and some posts made by sublimehq - that said, I know that Sublime, when reading syntax definitions, creates a "regex cache" file, which contains all regex patterns in the syntax definition. However, there are some patterns it can never cache - and those are the pop patterns that make use of backrefs. So I'm pretty sure that Sublime doesn't "special case" it - whenever a "target" context is pushed from a match pattern that had capture groups, it most likely stores the text that was captured and at that point in time / for that parse state: "injects" those into any pop patterns in the root level of that context that use backrefs. Therefore it it would be the same code for with_prototype as well as push/set. And because it flattens all regexs into a cache, it maybe doesn't have to worry about includes needing separate logic (#104), as the patterns from those contexts are just seen as being direct children of the context.

I'd also be interested to see how performance would differ for the point by point approach.
Compatibility should of course be an end goal, but for edge cases like this I would personally prioritize clean code and performance for now :)

I haven't looked at Syntect at all, but FWIW Sublime Text does cache the next matching position for each regex (at least when we're using the Onigurama fallback, there's no need for this with sregex).

Only regexes associated with push, set, and embed actions are allowed to match against zero length input, other actions use ONIG_OPTION_FIND_NOT_EMPTY.

Thanks Jon, useful bit of info to know ๐Ÿ‘ that should resolve the differences in how syntect parses LaTeX compared to ST ๐ŸŽ†

I'd like to ask, is this part of fixing these tests:

failures:
    highlighting::highlighter::tests::can_parse
    highlighting::highlighter::tests::can_parse_with_highlight_state_from_cache
    highlighting::highlighter::tests::test_ranges
    highlighting::theme_set::tests::can_parse_common_themes
    html::tests::strings
    html::tests::tricky_test_syntax
    parsing::parser::tests::can_compare_parse_states
    parsing::parser::tests::can_parse_backrefs
    parsing::parser::tests::can_parse_includes
    parsing::parser::tests::can_parse_issue25
    parsing::parser::tests::can_parse_preprocessor_rules
    parsing::parser::tests::can_parse_simple
    parsing::parser::tests::can_parse_yaml
    parsing::syntax_set::tests::can_load

Because these currently fail on main.

"syntax tests" refers to http://www.sublimetext.com/docs/syntax.html#testing, specifically the test files in the sublimehq/Packages repo which syntect has as a submodule. All Rust tests should be working...

Thanks! I believe what happened was that locally, I didn't run

git submodule init
git submodule update

That helped with a bunch of tests; I'm sorry for diverging the discussion here. Although, I think this little instruction should be available somewhere.