jgm/skylighting

Divergences with KDE on the interpretation of some syntax rules

jonathanpoelen opened this issue · 8 comments

Hi, there are some discrepancies on the interpretation of some rules like <Int>. Original bug here: https://bugs.kde.org/show_bug.cgi?id=468394

Skylighting doesn't detect 12 in 12L as an integer. Here's what kate does:

$ kate-syntax-highlighter --output-format=ansi256colors --syntax-trace=format --stdin -s R
12
|
Int
12L
| |
Int
  |
  Normal Text

<Int>, <Float>, <HlCHex> and <HlCOct> check the delimiter on the left, but not on the right. I think the behavior diverges there.

For #156, the issue is related to line 802 of prolog.xml (<DetectIdentifier />), which doesn't specify an attribute and uses the context attribute (Syntax Error). However, we need to take the one from the context which includes, line 654 <IncludeRules context="quoted_last"/>) in the context <context name="dq" lineEndContext="#stay" attribute="&quot;double-quoted&quot;".

I note that #43 and #86 are fixed upstream and in skylighting.

jgm commented

Is the L only allowed as a modifier on an Int, or also on Float, HlCHex, HlCOct?

For all 4, but not specially L, anything after the digits is ignored in these rules.

jgm commented

I'm still confused about this.
Over at the KDE documentation it says this about Int:

Detect an integer number (as the regular expression: \b[0-9]+).

Nothing there indicates that the rule is supposed to consume any characters after the digits...

I probably misspoke. When I say ignored, I mean that the rule stops its evaluation and stops consuming, nothing more. From my point of view and based on -T flag, skylighting does the equivalent of \b[0-9]+\b (the second \b being too much). Also, it's not explained - or very implicitly - but \b is a fusion of deliminators and local deliminators (not a strict equivalent of regex).

I'm the original reporter of issue #468394 that jonathanpoelen linked to (thanks, Jonathan, for opening this here). Just to (hopefully?) clarify: The KDE Int rule does seem to behave as in the documentation (ignoring whatever might follow a sequence of digits), so in KDE, it correctly tags the digits from a sequence like 12L and ignores the L. But (it seems) that same Int rule fails to tag the digits when it's interpreted through skylighting (at least as used by pandoc, specifically with the r.xml syntax highlighting file). I couldn't test elsewhere; but replacing the Int rule with a corresponding RegExpr rule (including the KDE-documentation-suggested regex) worked fine, tagging the digits and leaving following material (the L, etc.) for other rules. The Int rule does work as expected in skylighting/pandoc when there's not an alpha character immediately after the digits.

jgm commented

Thanks for the clarifications.

jgm commented

Yes we have

pDec :: A.Parser ()
pDec = do
  mbMinus
  _ <- A.takeWhile1 (A.inClass "0-9")
  guardWordBoundary

and the guardWordBoundary here is preventing this from being parsed as an int.

Can I just remove that? KDE will parse 12hello as an integer followed by other stuff?

Can I just remove that? KDE will parse 12hello as an integer followed by other stuff?

Yes