2.13 Regression of XML syntax highlighting (md -> html5)
kazalex opened this issue · 11 comments
markdown:
``` {.xml}
<?xml version="1.0" encoding="utf-8"?>
<methodCall>
<methodName>system.listMethods</methodName>
<params/>
</methodCall>
```
2.12 (XML tags has span class "keyword"):
<div class="sourceCode" id="cb10"><pre class="sourceCode xml"><code class="sourceCode xml"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="kw"><?xml</span> version="1.0" encoding="utf-8"<span class="kw">?></span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><span class="kw"><methodCall></span></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> <span class="kw"><methodName></span>system.listMethods<span class="kw"></methodName></span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a> <span class="kw"><params/></span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="kw"></methodCall></span></span></code></pre></div>
</div>
2.13 (XML tags has span class "error"):
<div class="sourceCode" id="cb10"><pre class="sourceCode xml"><code class="sourceCode xml"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="fu"><?xml</span><span class="ot"> version=</span><span class="st">"1.0"</span><span class="ot"> encoding=</span><span class="st">"utf-8"</span><span class="fu">?></span></span>
<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a><<span class="ot">methodCall</span><span class="er">></span></span>
<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a> <span class="er"><methodName>system.listMethods</methodName></span></span>
<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a> <span class="er"><params/></span></span>
<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a><span class="er"></methodCall></span></span></code></pre></div>
</div>
Simple repro:
<a>
<b/>
</a>
Everything from the closing >
on the first line on is error token.
In fact just <a>
is enough to reproduce.
Here is trace output for the tokenizer:
Trying rule Rule {rMatcher = IncludeRules ("XML","FindXML"), rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = DetectSpaces, rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = StringDetect "<!--", rAttribute = CommentTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","Comment")]}
Trying rule Rule {rMatcher = StringDetect "<![CDATA[", rAttribute = BaseNTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = True, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","CDATAStart")]}
Trying rule Rule {rMatcher = RegExpr (RE {reString = "<!(?=DOCTYPE\\s+)", reCaseSensitive = True}), rAttribute = DataTypeTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","DoctypeTagName")]}
Trying rule Rule {rMatcher = IncludeRules ("XML","FindProcessingInstruction"), rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
Trying rule Rule {rMatcher = RegExpr (RE {reString = "<\\?(?=([\\w:_-]*))", reCaseSensitive = True}), rAttribute = FunctionTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","PI TagName")]}
Trying rule Rule {rMatcher = RegExpr (RE {reString = "<(?=((?![0-9])[\\w_:][\\w.:_-]*))", reCaseSensitive = True}), rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","ElementTagName")]}
RegExpr MATCHED Just (NormalTok,"<")
CONTEXT STACK ["ElementTagName","Start"]
IncludeRules MATCHED Just (NormalTok,"<")
Trying rule Rule {rMatcher = StringDetect "%1", rAttribute = KeywordTok, rIncludeAttribute = False, rDynamic = True, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Pop,Push ("XML","Element")]}
CONTEXT STACK ["Start"]
CONTEXT STACK ["Element","Start"]
Trying rule Rule {rMatcher = Detect2Chars '/' '>', rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Pop]}
Trying rule Rule {rMatcher = DetectChar '>', rAttribute = NormalTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","El Content")]}
Trying rule Rule {rMatcher = RegExpr (RE {reString = "(?:^|\\s+)(?![0-9])[\\w_:][\\w.:_-]*", reCaseSensitive = True}), rAttribute = OtherTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","Attribute")]}
RegExpr MATCHED Just (OtherTok,"a")
CONTEXT STACK ["Attribute","Element","Start"]
Trying rule Rule {rMatcher = DetectChar '=', rAttribute = OtherTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Pop,Push ("XML","Value")]}
Trying rule Rule {rMatcher = RegExpr (RE {reString = "\\S", reCaseSensitive = True}), rAttribute = ErrorTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = []}
RegExpr MATCHED Just (ErrorTok,">")
<a>
Here are the latest changes to xml.xml syntax definition, which I merged from upstream (KDE) before this release:
diff --git a/skylighting-core/xml/xml.xml b/skylighting-core/xml/xml.xml
index d2cc327..fbefcb6 100644
--- a/skylighting-core/xml/xml.xml
+++ b/skylighting-core/xml/xml.xml
@@ -6,7 +6,7 @@
<!ENTITY name "(?![0-9])[\w_:][\w.:_-]*">
<!ENTITY entref "&(?:#[0-9]+|#[xX][0-9A-Fa-f]+|&name;);">
]>
-<language name="XML" version="11" kateversion="5.0" section="Markup" extensions="*.docbook;*.xml;*.rc;*.daml;*.rdf;*.rss;*.xspf;*.xsd;*.svg;*.ui;*.kcfg;*.qrc;*.wsdl;*.scxml;*.xbel;*.dae;*.sch;*.brd" mimetype="text/xml;text/book;text/daml;text/rdf;application/rss+xml;application/xspf+xml;image/svg+xml;application/x-designer;application/x-xbel;application/xml;application/scxml+xml" casesensitive="1" indenter="xml" author="Wilbert Berendsen (wilbert@kde.nl)" license="LGPL">
+<language name="XML" version="12" kateversion="5.0" section="Markup" extensions="*.docbook;*.xml;*.rc;*.daml;*.rdf;*.rss;*.xspf;*.xsd;*.svg;*.ui;*.kcfg;*.qrc;*.wsdl;*.scxml;*.xbel;*.dae;*.sch;*.brd" mimetype="text/xml;text/book;text/daml;text/rdf;application/rss+xml;application/xspf+xml;image/svg+xml;application/x-designer;application/x-xbel;application/xml;application/scxml+xml" casesensitive="1" indenter="xml" author="Wilbert Berendsen (wilbert@kde.nl)" license="LGPL">
<highlighting>
<contexts>
@@ -17,10 +17,10 @@
<context name="FindXML" attribute="Normal Text" lineEndContext="#stay">
<DetectSpaces />
<StringDetect attribute="Comment" context="Comment" String="<!--" beginRegion="comment" />
- <StringDetect attribute="CDATA" context="CDATA" String="<![CDATA[" beginRegion="cdata" />
- <RegExpr attribute="Doctype" context="Doctype" String="<!DOCTYPE\s+" beginRegion="doctype" />
- <RegExpr attribute="Processing Instruction" context="PI" String="<\?[\w:_-]*" beginRegion="pi" />
- <RegExpr attribute="Element" context="Element" String="<&name;" beginRegion="element" />
+ <StringDetect attribute="CDATA" context="CDATAStart" String="<![CDATA[" lookAhead="true" />
+ <RegExpr attribute="Doctype Symbols" context="DoctypeTagName" String="<!(?=DOCTYPE\s+)" beginRegion="doctype" />
+ <IncludeRules context="FindProcessingInstruction" />
+ <RegExpr attribute="Element Symbols" context="ElementTagName" String="<(?=(&name;))" beginRegion="element" />
<IncludeRules context="FindEntityRefs" />
<DetectIdentifier />
</context>
@@ -45,32 +45,63 @@
<DetectIdentifier />
</context>
+ <context name="CDATAStart" attribute="Other Text" lineEndContext="#pop">
+ <StringDetect attribute="CDATA Symbols" context="#stay" String="<![" beginRegion="cdata" />
+ <StringDetect attribute="CDATA" context="#stay" String="CDATA" />
+ <DetectChar attribute="CDATA Symbols" context="#pop!CDATA" char="[" />
+ </context>
<context name="CDATA" attribute="Other Text" lineEndContext="#stay">
<DetectSpaces />
<DetectIdentifier />
- <StringDetect attribute="CDATA" context="#pop" String="]]>" endRegion="cdata" />
+ <StringDetect attribute="CDATA Symbols" context="#pop" String="]]>" endRegion="cdata" />
<StringDetect attribute="EntityRef" context="#stay" String="]]&gt;" />
</context>
+ <context name="FindProcessingInstruction" attribute="Other Text" lineEndContext="#stay">
+ <RegExpr attribute="PI Symbols" context="PI TagName" String="<\?(?=([\w:_-]*))" beginRegion="pi" />
+ </context>
+ <context name="PI TagName" attribute="Other Text" lineEndContext="#pop!PI" fallthrough="true" fallthroughContext="#pop!PI">
+ <RegExpr attribute="Processing Instruction" context="#pop!PI-XML" String="xml(?=\s|$)" insensitive="true" />
+ <StringDetect attribute="Processing Instruction" context="#pop!PI" String="%1" dynamic="true" />
+ </context>
<context name="PI" attribute="Other Text" lineEndContext="#stay">
- <Detect2Chars attribute="Processing Instruction" context="#pop" char="?" char1=">" endRegion="pi" />
+ <Detect2Chars attribute="PI Symbols" context="#pop" char="?" char1=">" endRegion="pi" />
+ </context>
+ <context name="PI-XML" attribute="Other Text" lineEndContext="#stay">
+ <IncludeRules context="PI" />
+ <RegExpr attribute="Attribute" context="#stay" String="(?:^|\s+)&name;" />
+ <DetectChar attribute="Attribute" context="Value" char="=" />
</context>
+ <context name="DoctypeTagName" attribute="Other Text" lineEndContext="#pop">
+ <StringDetect attribute="Doctype" context="#pop!DoctypeVariableName" String="DOCTYPE" />
+ </context>
+ <context name="DoctypeVariableName" attribute="Other Text" lineEndContext="#pop!Doctype" fallthrough="true" fallthroughContext="#pop!Doctype">
+ <DetectSpaces />
+ <RegExpr attribute="Doctype Name" context="#pop!Doctype" String="&name;" />
+ </context>
<context name="Doctype" attribute="Other Text" lineEndContext="#stay">
- <DetectChar attribute="Doctype" context="#pop" char=">" endRegion="doctype" />
- <DetectChar attribute="Doctype" context="Doctype Internal Subset" char="[" beginRegion="int_subset" />
+ <DetectChar attribute="Doctype Symbols" context="#pop" char=">" endRegion="doctype" />
+ <DetectChar attribute="Doctype Symbols" context="Doctype Internal Subset" char="[" beginRegion="int_subset" />
</context>
<context name="Doctype Internal Subset" attribute="Other Text" lineEndContext="#stay">
- <DetectChar attribute="Doctype" context="#pop" char="]" endRegion="int_subset" />
- <RegExpr attribute="Doctype" context="Doctype Markupdecl" String="<!(?:ELEMENT|ENTITY|ATTLIST|NOTATION)\b" />
+ <DetectChar attribute="Doctype Symbols" context="#pop" char="]" endRegion="int_subset" />
+ <RegExpr attribute="Doctype Symbols" context="Doctype Markupdecl TagName" String="<!(?=(ELEMENT|ENTITY|ATTLIST|NOTATION)\b)" />
<StringDetect attribute="Comment" context="Comment" String="<!--" beginRegion="comment" />
- <RegExpr attribute="Processing Instruction" context="PI" String="<\?[\w:_-]*" beginRegion="pi" />
+ <IncludeRules context="FindProcessingInstruction" />
<IncludeRules context="FindPEntityRefs" />
</context>
+ <context name="Doctype Markupdecl TagName" attribute="Other Text" lineEndContext="#pop">
+ <StringDetect attribute="Doctype" context="#pop!Doctype Markupdecl VariableName" String="%1" dynamic="true" />
+ </context>
+ <context name="Doctype Markupdecl VariableName" attribute="Other Text" lineEndContext="#pop!Doctype Markupdecl" fallthrough="true" fallthroughContext="#pop!Doctype Markupdecl">
+ <DetectSpaces />
+ <RegExpr attribute="Doctype Name" context="#pop!Doctype Markupdecl" String="&name;" />
+ </context>
<context name="Doctype Markupdecl" attribute="Other Text" lineEndContext="#stay">
- <DetectChar attribute="Doctype" context="#pop" char=">" />
+ <DetectChar attribute="Doctype Symbols" context="#pop" char=">" />
<DetectChar attribute="Value" context="Doctype Markupdecl DQ" char=""" />
<DetectChar attribute="Value" context="Doctype Markupdecl SQ" char="'" />
</context>
@@ -85,25 +116,31 @@
<IncludeRules context="FindPEntityRefs" />
</context>
+ <context name="ElementTagName" attribute="Other Text" lineEndContext="#pop!Element" fallthrough="true" fallthroughContext="#pop!Element">
+ <StringDetect attribute="Element" context="#pop!Element" String="%1" dynamic="true" />
+ </context>
<context name="Element" attribute="Other Text" lineEndContext="#stay">
- <Detect2Chars attribute="Element" context="#pop" char="/" char1=">" endRegion="element" />
- <DetectChar attribute="Element" context="El Content" char=">" />
+ <Detect2Chars attribute="Element Symbols" context="#pop" char="/" char1=">" endRegion="element" />
+ <DetectChar attribute="Element Symbols" context="El Content" char=">" />
<RegExpr attribute="Attribute" context="Attribute" String="(?:^|\s+)&name;" />
<RegExpr attribute="Error" context="#stay" String="\S" />
</context>
<context name="El Content" attribute="Other Text" lineEndContext="#stay">
- <RegExpr attribute="Element" context="El End" String="</&name;" />
+ <RegExpr attribute="Element Symbols" context="El End TagName" String="</(?=(&name;))" />
<IncludeRules context="FindXML" />
</context>
+ <context name="El End TagName" attribute="Other Text" lineEndContext="#pop!El End" fallthrough="true" fallthroughContext="#pop!El End">
+ <StringDetect attribute="Element" context="#pop!El End" String="%1" dynamic="true" />
+ </context>
<context name="El End" attribute="Other Text" lineEndContext="#stay">
- <DetectChar attribute="Element" context="#pop#pop#pop" char=">" endRegion="element" />
+ <DetectChar attribute="Element Symbols" context="#pop#pop#pop" char=">" endRegion="element" />
<RegExpr attribute="Error" context="#stay" String="\S" />
</context>
<context name="Attribute" attribute="Other Text" lineEndContext="#stay">
- <DetectChar attribute="Attribute" context="Value" char="=" />
+ <DetectChar attribute="Attribute" context="#pop!Value" char="=" />
<RegExpr attribute="Error" context="#stay" String="\S" />
</context>
@@ -114,29 +151,34 @@
</context>
<context name="Value DQ" attribute="Value" lineEndContext="#stay">
- <DetectChar attribute="Value" context="#pop#pop#pop" char=""" />
+ <DetectChar attribute="Value" context="#pop#pop" char=""" />
<IncludeRules context="FindEntityRefs" />
</context>
<context name="Value SQ" attribute="Value" lineEndContext="#stay">
- <DetectChar attribute="Value" context="#pop#pop#pop" char="'" />
+ <DetectChar attribute="Value" context="#pop#pop" char="'" />
<IncludeRules context="FindEntityRefs" />
</context>
</contexts>
<itemDatas>
- <itemData name="Normal Text" defStyleNum="dsNormal" />
- <itemData name="Other Text" defStyleNum="dsNormal" />
- <itemData name="Comment" defStyleNum="dsComment" spellChecking="false" />
- <itemData name="CDATA" defStyleNum="dsBaseN" bold="1" spellChecking="false" />
- <itemData name="Processing Instruction" defStyleNum="dsKeyword" spellChecking="false" />
- <itemData name="Doctype" defStyleNum="dsDataType" bold="1" spellChecking="false" />
- <itemData name="Element" defStyleNum="dsKeyword" spellChecking="false" />
- <itemData name="Attribute" defStyleNum="dsOthers" spellChecking="false" />
- <itemData name="Value" defStyleNum="dsString" spellChecking="false" />
- <itemData name="EntityRef" defStyleNum="dsDecVal" spellChecking="false" />
- <itemData name="PEntityRef" defStyleNum="dsDecVal" spellChecking="false" />
- <itemData name="Error" defStyleNum="dsError" spellChecking="false" />
+ <itemData name="Normal Text" defStyleNum="dsNormal" />
+ <itemData name="Other Text" defStyleNum="dsNormal" />
+ <itemData name="Comment" defStyleNum="dsComment" spellChecking="false" />
+ <itemData name="CDATA" defStyleNum="dsBaseN" bold="1" italic="0" spellChecking="false" />
+ <itemData name="CDATA Symbols" defStyleNum="dsBaseN" bold="0" italic="0" spellChecking="false" />
+ <itemData name="Processing Instruction" defStyleNum="dsFunction" bold="1" italic="0" spellChecking="false" />
+ <itemData name="PI Symbols" defStyleNum="dsFunction" bold="0" italic="0" spellChecking="false" />
+ <itemData name="Doctype" defStyleNum="dsDataType" bold="1" italic="0" spellChecking="false" />
+ <itemData name="Doctype Name" defStyleNum="dsDataType" bold="0" italic="0" spellChecking="false" />
+ <itemData name="Doctype Symbols" defStyleNum="dsDataType" bold="0" italic="0" spellChecking="false" />
+ <itemData name="Element" defStyleNum="dsKeyword" spellChecking="false" />
+ <itemData name="Element Symbols" defStyleNum="dsNormal" spellChecking="false" />
+ <itemData name="Attribute" defStyleNum="dsOthers" spellChecking="false" />
+ <itemData name="Value" defStyleNum="dsString" spellChecking="false" />
+ <itemData name="EntityRef" defStyleNum="dsDecVal" spellChecking="false" />
+ <itemData name="PEntityRef" defStyleNum="dsDecVal" spellChecking="false" />
+ <itemData name="Error" defStyleNum="dsError" spellChecking="false" />
</itemDatas>
</highlighting>
I can confirm that reverting xml.xml to the version from 0.10.5 works.
So something in this round of changes caused the problem.
End of trace for the working one:
CONTEXT STACK ["Element","Start"]
IncludeRules MATCHED Just (KeywordTok,"<a")
Trying rule Rule {rMatcher = Detect2Chars '/' '>', rAttribute = KeywordTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Pop]}
Trying rule Rule {rMatcher = DetectChar '>', rAttribute = KeywordTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","El Content")]}
DetectChar MATCHED Just (KeywordTok,">")
CONTEXT STACK ["El Content","Element","Start"]
So the issue in the current version is this:
Trying rule Rule {rMatcher = RegExpr (RE {reString = "(?:^|\\s+)(?![0-9])[\\w_:][\\w.:_-]*", reCaseSensitive = True}), rAttribute = OtherTok, rIncludeAttribute = False, rDynamic = False, rCaseSensitive = True, rChildren = [], rLookahead = False, rFirstNonspace = False, rColumn = Nothing, rContextSwitch = [Push ("XML","Attribute")]}
RegExpr MATCHED Just (OtherTok,"a")
CONTEXT STACK ["Attribute","Element","Start"]
Why does it think we have an attribute?
The relevant xml is
<RegExpr attribute="Attribute" context="Attribute" String="(?:^|\s+)&name;" />
However, note that this was not changed in the latest changes.
I can see what should be happening. First, we should be matching
<RegExpr attribute="Element Symbols" context="ElementTagName" String="<(?=(&name;))" beginRegion="element" />
and the element name should be captured.
Then, we go to ElementTagName context, and match
<StringDetect attribute="Element" context="#pop!Element" String="%1" dynamic="true" />
With %1 = the previously matched element name. But this isn't occurring. Why not?
Answer: %1 is not defined.
Confirmed that the regex application here doesn't produce a captured group.
That is a bug in our regex engine.
In ghci we can see the root issue:
Prelude Regex.KDE> testRegex False "<(a+)" "<a>"
Just ("<a",[(1,"a")])
Prelude Regex.KDE> testRegex False "<(?=(a+))" "<a>"
Just ("<",[])
Captures are ignored inside the lookahead (?=...)
.
*Skylighting.Types Regex.KDE> compileRegex False "<(?=(a+))"
Right (MatchConcat (MatchChar <fn>) (MatchConcat (AssertPositive Forward (MatchConcat (MatchCapture 1 (MatchConcat (MatchSome (MatchChar <fn>)) MatchNull)) MatchNull)) MatchNull))
Note MatchCapture 1
is in there, but somehow it seems not to wokr.