Syntax Development Tips/Advice
wbond opened this issue ยท 23 comments
If you've spent some time writing syntaxes, take a moment here and share any revelations you've had, or tips on this to test or look for.
Check for Scope Doubling
The characters (
, )
, {
, }
, [
, ]
are very easy to double scopes on via meta_scope
. I try to add tests for ^ punctuation
and then also ^ - punctuation punctuation
to ensure they aren't there.
Stateful Chaining
Don't Over-Use
sublime-syntax
makes it very easy to have complex interlocking chains of stateful contexts which transition from one to the other via set
. This is sometimes necessary to achieve the desired scoping, but it's also very easy to get lost in the mire. Always convince yourself that you need this feature before you use it.
Push Your First State
While it is absolutely possible to have a match
in main
which set
s into a chain of stateful contexts, and subsequently set
s back into main
at the end, it is not recommended. main
should be a stateless "baseline" context that is always the last element on the stack. Instead, have your match
in main
use push
to get into your first state, then pop
out of the last state. For example, imagine we wanted to match the sequence abc
with each character scoped differently, and only when they follow each other. For illustration purposes, we will also match numerics in main
:
contexts:
main:
- match: a
scope: first
push: expect-b
- match: \d+
scope: constant.numeric
expect-b:
- match: b
scope: second
set: expect-c
expect-c:
- match: c
scope: third
pop: true
Notice how a
pushes expect-b
. We don't set
the first context, only the second one. Once we find the terminator, we pop
out.
Lookahead Push for Meta-Scoping
Sometimes you need to apply a meta-scope to an entire stateful chunk. When this is the case, you almost certainly want your push
rule to be a non-consuming lookahead, rather than a consuming scoped match. We can modify the above:
contexts:
main:
- match: (?=a)
push: expect-a
- match: \d+
scope: constant.numeric
expect-a:
- meta_scope: meta.abc
- match: a
scope: first
set: expect-b
expect-b:
- meta_scope: meta.abc
- match: b
scope: second
set: expect-c
expect-c:
- meta_scope: meta.abc
- match: c
scope: third
pop: true
Bail Outs
Always remember that you're writing a parser for a set of partially valid syntax fragments. The normal mode of operation is that someone is actively typing new text. For this reason, you need to make sure that any and all stateful contexts you use have aggressive "bail-outs" for when something goes wrong. As a rule of thumb, if there's a case where a compiler's parser would have produced an error, your syntax mode should handle that case by pop
ing back to main
.
Consider the example from above. Imagine the user is typing typing into the following buffer:
42
ab
12
Even if the user is actively typing c
following b
, it would be a terrible experience for the scoping on 12
to shift back and forth as they type in the middle. For this reason, you should always end your mid-state scopes with a lookahead match on (?=\S)
which pops out of the state chain. Like so:
contexts:
main:
- match: (?=a)
push: expect-a
- match: \d+
scope: constant.numeric
bail-out:
- match: (?=\S)
pop: true
expect-a:
- meta_scope: meta.abc
- match: a
scope: first
set: expect-b
- include: bail-out
expect-b:
- meta_scope: meta.abc
- match: b
scope: second
set: expect-c
- include: bail-out
expect-c:
- meta_scope: meta.abc
- match: c
scope: third
pop: true
- include: bail-out
Now, when the user starts with the following buffer:
42
12
They can place the cursor on the second line and type a
and the scoping on 12
will remain unchanged. Getting this wrong is one of the easiest ways to create a terrible experience for users of your mode without even realizing it yourself.
Test Partially-Valid Buffers
Don't just test that correctly-written constructs were scoped appropriately. Test that partial fragments also scope in a reasonable way. Test that unrelated constructs which come lexically after these partially-valid constructs are also scoped correctly.
Organize
There are no bonus points (or bonus performance) for brevity. Organize your sublime-syntax
file the way you would organize any serious bit of code. Use spaces, newline breaks ((?x)
in your regex patterns can be invaluable!) and comments to your advantage.
Use the Stack
Long chains of set
contexts can be difficult to follow and impossible to understand at a glance. When you have a construction where you expect a list of elements in sequence, put them all onto the stack at once. The stack will unwind as the elements are recognized. Example:
contexts:
else-pop:
- match: (?=\S)
pop: true
function-body:
- match: \{
scope: punctuation.section.braces.begin.js
set:
- meta-function-body
- expect-closing-brace
- statements
- directives
meta-function-body:
- meta_scope: meta.function.body.js
- include: else-pop
expect-closing-brace:
- match: \}
scope: punctuation.section.braces.end.js
pop: true
- include: else-pop
statements:
- match: (?=\})
pop: true
- ...
directives:
- match: "'use (?:strict|asm)';"
scope: keyword.other.directive.js
- include: else-pop
As an added benefit, most of these scopes can be reused:
statements:
...
- match: \{
scope: punctuation.section.braces.begin.js
push:
- meta-block
- expect-closing-brace
- statements
...
meta-block:
- meta_scope: meta.block.js
- include: else-pop
And they can be easily composed:
expression:
...
- match: \bfunction\b
scope: meta.function storage.type.function.js
set:
- function-body
- function-parameters // Implementations omitted
- function-name
...
As a bonus, states stacked this way are implicitly optional. If one is omitted, the highlighter will move on to the next without interruption. For instance, in the last example, the construction will be parsed correctly whether or not the author supplies a function name.
(At first, I was concerned about the efficiency of all of the stack manipulation and context switching, but in practice it seems to be just as fast as the traditional method. The JS+Babel+React+Flow syntax I've developed this architecture for runs about 15% faster than the stock JS syntax.)
Preprocessing with YAML Macros
Syntax definitions can have a lot of repetitive elements to them. Sometimes, these elements are simple enough that you can simple include
a utility context:
contexts:
else-pop:
- match: (?=\S)
pop: true
order-expression:
- match: (?i)\b(?:ASC|DESC|NULLS|FIRST|LAST)\b
scope: keyword.other.sql
- include: else-pop
But this example contains another common SQL idiom: keywords are always case-insensitive and surrounded by word breaks, so the syntax will contain many repetitions of the (?i)\b(?:...)\b
pattern. A typo could easily slip into one such repetition. So why write it yourself each time? Instead, use YAML tags and a preprocessor:
order-expression:
- match: !word ASC|DESC|NULLS|FIRST|LAST
scope: keyword.other.sql
- include: else-pop
The macro:
# macros.py
def word(match):
return r'(?i)\b(?:%s)\b' % match
The engine:
# build.py
import yaml, sys
from os import path
filename = sys.argv[1]
output_path, extension = path.splitext(path.basename(filename))
if extension != '.source': raise "Not a .source file!"
input_file = open(filename, 'r')
output_file = open(output_path, 'w')
import macros
PREAMBLE = '%YAML 1.2\n---\n'
def apply_transform(loader, node, transform):
try:
if isinstance(node, yaml.ScalarNode):
return transform(loader.construct_scalar(node))
elif isinstance(node, yaml.SequenceNode):
return transform(*loader.construct_sequence(node))
elif isinstance(node, yaml.MappingNode):
return transform(**loader.construct_mapping(node))
except TypeError as e:
print('Failed to transform node: {}\n{}'.format(str(e), node))
def get_constructor(transform):
return lambda loader, node: apply_transform(loader, node, transform)
for name, transform in macros.__dict__.items():
if callable(transform):
yaml.add_constructor(
'!'+name.lstrip('_'),
get_constructor(transform)
)
syntax = yaml.load(input_file)
output_file.write(PREAMBLE)
yaml.dump(syntax, output_file)
Another example:
def meta(name):
return [
{ 'meta_scope': 'meta.%s.sql' % name, },
{ 'match': '', 'pop': True },
]
block:
- match: !word BEGIN
scope: keyword.control.sql
push:
- !meta block
- statements
We can go further. In Oracle SQL, any identifier may be wrapped in optional double quotes:
create table mytable ... ;
create table "mytable" ... ;
Implementing both versions is possible:
expect-table-name:
- match: \b{{ident}}\b
scope: entity.name.table.sql
pop: true
- match: (")([^"]+)(")
captures:
1: punctuation.definition.string.begin.sql
2: entity.name.table.sql
3: punctuation.definition.string.end.sql
pop: true
- include: else-pop
But then we have to do this for every single type of identifier -- procedure names, aliases, variables, etc. So we write a macro for it:
def expect_identifier(scope):
return [
{ 'match': r'\b{{ident}}\b', 'scope': scope, 'pop': True },
{
'match': r'(")([^"]+)(")',
'scope': 'string.quoted.double.sql',
'captures': {
'1': 'punctuation.definition.string.begin.sql',
'2': scope,
'3': 'punctuation.definition.string.end.sql',
},
'pop': True,
},
{ 'match': r'(?=\S)', 'pop': True },
]
Define all of these scopes the same way:
expect-table-name: !expect_identifier entity.name.table.sql
expect-alias: !expect_identifier entity.name.alias.sql
Or just use the macro inline:
declarations:
...
- match: !word TYPE
scope: storage.type.sql
push:
- !meta declaration.type
- type-definition-value
- !expect_keyword IS
- !expect_identifier entity.name.type.sql
...
Using macros, you can define very complicated constructs in a compact fashion that is easy to understand and reasonably robust against invalid input:
- match: !word FOR
scope: keyword.control.sql
push:
- !meta control.for
- !expect_keyword LOOP
- !expect_keyword END
- statements
- !expect_keyword LOOP
- expression
- !expect [ \.\., keyword.operator.other ]
- expression
- !expect_keyword REVERSE
- !expect_keyword IN
- !expect_identifier variable.other.sql
Keep matches concise
Where possible, avoid match patterns like .*
that match the whole line (i.e. for line comments), and instead use a meta_scope
and just match the character sequences that need specific scoping or the end of the line ($\n?
) that will pop the context.
Why? Because sometimes the syntax could get embedded in another syntax, and that syntax might want to use with_prototype
to pop earlier than would otherwise be possible if the match pattern consumes the whole line.
For example, if you have a language that defines line comments like this:
comments:
- match: (//).*$\n?
scope: comment.line.double-slash.example
captures:
1: punctuation.definition.comment.example
and then want to include it in HTML, for example using a well-known example of PHP style markers:
embed:
- match: '<\?'
scope: punctuation.section.embedded.begin.example
push: [end_embed, scope:base.scope.for.example.above]
with_prototype:
- match: '(?=\?>)'
pop: true
end_embed:
- match: '\?>'
scope: punctuation.section.embedded.end.example
pop: true
then you will find code like:
<?php // comment here ?><div>HTML here</div>
where the with_prototype
didn't apply because it doesn't interrupt a match - the (//).*
from the comment pattern will match // comment here ?><div>HTML here</div>
so there is no way for the with_prototype
to see the ?>
, which means the whole line will be erroneously scoped as a comment.
So the proper way to declare the comments
context would be:
comments:
- match: '//'
scope: punctuation.definition.comment.example
push:
- meta_scope: comment.line.double-slash.example
- match: $\n? # Consume the newline so that completions aren't shown at the end of the line when typing comments
pop: true
Unusual Syntaxes and Their Pitfalls
Lately I've started working on a syntax file for an Interactive Fiction (text adventures) programming language which tries to hide the complexity of programming and look like natural English as much as possible. Because of its scarse use of punctuations, I've found myself facing some unexpected problems โ the full story here:
- https://forum.sublimetext.com/t/unusual-syntax-help-needed-with-scopes-naming/36354/12
- https://forum.sublimetext.com/t/syntax-definitions-how-to-force-pop-out-of-the-stack/36376/6
Particularly, I was struggling in handling closing block statement of the type END EVERY identifier.
where both the identifier and the dot terminator where optional. After running in circles for many hours (and due to lack of experience with ST syntaxes), I've managed to achieve it, and would like to share some learned lesson here โ they might not be the best solutions, but surely they represent the problem newbies are going to face.
Forceful Popping
First of all, I faced the problem of how to pop out of the stack in certain (unavoidable) situations. And I was kindly introduced to the else_POP
technique mentioned earlies in this thread. Still ... the chain of included contexts were not behaving as expected.
The main problem was tied to uncosumed whitespace: the else_POP
and immediate_POP
tricks to force your way out of a stacked context can cause a premature pop if there is still some whitespace floating around which isn't captured by the various included contexts. At the end, I had to ensure that the pattern that would set
the context on the stack would also eat up (and dicard) any trailing whitespace.
Also, another problem was the END EVERY
construct having two optional trailing keywords (identifier and dot-terminator). To prevent loose scopes floating about, I had to implement an extra check for an END EVERY
statement followed by only whitespace (ie: neither ID nor terminator).
Here here is the code of how I've managed to workaround the problem:
class:
- match: (?i)\bEVERY\b
scope: storage.type.class.alan
set: [class_body, class_identifier]
class_body:
- meta_scope: meta.class.alan
- include: class_head
- include: class_tail
class_head:
- match: (?i)\bIsA\b
scope: storage.modifier.extends
push: inherited_class_identifier
# TODO: inheritance
class_tail:
# ===========================
# END EVERY => no ID & no `.`
# ===========================
# When END EVERY is not followed by neither ID or dot, we must capture it
# separately to avoid stray scopes after it...
- match: (?i)\b(END\s*EVERY)\b(?=\s*)$
captures:
1: keyword.control.alan
pop: true
# ==========================
# END EVERY => ID and/or `.`
# ==========================
- match: (?i)\b(END\s*EVERY)\b\s* # <= must consume all whitespace!
captures:
1: keyword.control.alan
set:
- meta_content_scope: meta.class.alan
- include: class_tail_identifier
- include: terminator
- include: force_POP
terminator:
- match: '\.'
scope: punctuation.terminator.alan
It might not be the best solution, but it doesn't have to either: I'm in the early stages of creating this syntax, and sometimes you just need to get the job done and carry on drafting โ and things can turn out frustrating when you can't pinpoint what is breaking the expected behavior.
Watchlist of Common Newbie-Mistakes
The lesson I've learned from tackling with this problem is, which might help better understand which context(s) are causing the problem:
- Always beware of any leftover whitespace from regex match patterns:
- Try to consume trailing whitespace by capturing it with a discarded group
captures
is better thanscope
because it allows to add extra discarded groups for testing if leading/trailing whitespace is a problem
- Lookaheads are your best friends when it comes to handle optional syntax elements at the end of a
meta
scope - While working on a syntax's context:
- Annotate the stack level in side comments (it's so easy to loose track of how deep in the stack each context and included statements are)
- For reusable syntax-elements contexts, consider creating both a popless version and another one that pops out of the stack (sometimes you might be including them, other times pushing/setting them)
- In lack of context-stack debugger:
- add some arbitrary label to scopes in order to be able to track in the highlighted code which context is active (when there are variants) โ eg: by using
keyword.control.NONE.alan
andkeyword.control.IdOrDOTalan
I was able to uncover the leftover whitespace problem which was causing the wrong context to be used.
- add some arbitrary label to scopes in order to be able to track in the highlighted code which context is active (when there are variants) โ eg: by using
I've learned these small lessons the hard way, by running in circles for hours because I wasn't mind-tracking correctly the stack levels. Also, I struggled a lot with include
vs push
vs set
choices, trying to adpat the context to my own likings and pre-existing reusables, which turned out to be a very bad approach.
When starting to deal with lots of reusable contexts, and contexts nestings, it can quickly become a complex task to keep a clear mental picture of what is actually going on at the parser level. Unfortunately, we can't escape the unpleasant task of having to mentally track what RegEx patterns are capturing, consuming, discarding and how the various stacked contexts loop until they pop out.
Surely, as experience in working with syntaxes starts to set it one eventually develops a right mind frame on how to start out laying the foundations of the syntax with the right foot. The problem is that if the whole experience gets frustrating and no solutions seems possible, one might just give in and never reach that required experience (after all, experience breeds on sucesses, as well as failures).
A Syntax CLI Simulator/Debugger Would Be Invaluable
Another lesson I've learned:
If ST had a way to expose to the user the syntax parser's stack state, its ques, and some debug info about the text being processed, the regexs matches and failures, it would be much easier to trace where our custom syntaxes fail.
Any chance that (somewhere in the future) ST might also ship with a command line tool to debug syntax definitions? A console app that takes a syntax file and a source test-files as input and spits out two files: a scoped version of the souce file (an XML like doc tree) and a log file listing all the innerworkings of the parser engine. This would be an invaluable tool to both learn how to build syntaxes as well as to fix problems.
Learning to create custom syntaxes should be a pleasant experience, not a frustrating one. The official documentation on the topic is not exactly "exhaustive" (far from it), and most existing syntaxes are usually too large to be used as learning examples to start with.
Syntect: A Fallback Debugger
@kingkeith (which I believe might be @keith-hall here on GitHub) pointed out to me the availability of syntect, a "Rust library for syntax highlighting using Sublime Text syntax definitions" which offers debugging features via its --debug
argument:
While there is no assurance that its syntax parser follows 100% that of ST (and small edge cases could cause difference in behavior), it seems to support ST syntax files very well, which means it can be a valuable tool for debugging syntaxes inner workings (pending a dedicate debug tool from ST3).
I know that I ought not consume space in this thread for comments; but I can't refrain from thanking enough @djspiewak for his "Stateful Chaining" advise โ after reading it I've managed to correct my syntax draft to handle code fragments without breaking the user experience (while before I was working on the assumption that all code would be always wellformed), and it allowed me to handle better the stack and reusable contexts.
I really wish that I had found a link to this Issue in ST documentation on syntaxes โ it would have saved me hours of attempts, and spared me stress-induced psychosomatic complications. I must thank @kingkeith for having brought it to my attention (and for kindly helping me out, along with @ThomSmith, to work my way through the empasses of my first big syntax creation).
I'd also like to add a further tip on Stack POPping tricks.
Stack Popping Tricks
I've found the tricks in this thread on how to pop out of the stack very useful, and I'd like to add another variant which I ended up needing in some contexts, and some comments to the RegEx.
Force POP
Wherever included, this context will pop immediately.
force_POP:
- match: ''
pop: true
My understanding is that this is a RegEx that doesn't do anything (no match, no consumption) and always returns true
, thus forcing the pop: true
to act right in the spot.
Else POP
As already mentioned above by @Thom1729 (equivalent to @djspiewak's bail-out
):
else_POP:
- match: (?=\S)
pop: true
this context is great as an "else" condition for popping out of a context that could loop forever. Unlike force_POP
, this will work in those context that need to iterate over a few times before exiting.
Its RegEx pattern matches nothing followed by non-whitespace โ since it's a lookhead assertion, it doesn't consume anything either. My understanding is that it's just a lookahead operation carried out at the current position of the highlighter's buffer. Therefore, if there is some whitespace ahead this doesn't pop out.
If my understanding is correct, the difference in behavior between this and force_POP
is that else_POP
will not pop out while there are still non-white space token floating around โ I couldn't find any details on how the syntax parser actually handles resuming the context looping after a positive match (ie, if after a match it starts again from the top of the current context patterns list, or if it just carries on to the next pattern in the list), but my impression is that whitespace is silently consumed by the parser (unless a custom patterns consumes it), which means that else_POP
will not necessarily pop out of the context the first time it's encountered.
End-of-Line POP
I've come across some situation in which neither else_POP
or force_POP
did the job, and used instead:
eol_POP:
- match: '$'
pop: true
This is useful in situations when you need to pop your way out of the current context/Stack when the end of line is reached. Usually it works well with syntaxes where line breaks are not optional, and where a number of optional elements might follow; or just to handle incomplete code fragments.
NOTE โ I'll update this post to provide more detailed/correct information regarding the RegExs and parser internals if I get new info about it. Also, I'll add more Stak popping hacks if I encounter them.
Syntax Test Files
Syntax Test Files are a great functionality for automating syntax tests. Unfortunately, ST official documentation on the topic doesn't cover in depth Scope Selectors. While the examples are all there, I've found the following link to Textmate's documentation quite useful:
Specifically, I've learned more about the syntax for excluding or grouping scope selectors, via documented examples.
Testing Against Scope Spillings
I've learned how to use syntax files to check that scopes don't spill over to neighbouring elements and/or whitespace.
For example, in this syntax example I'm defining a class in Alan language:
EVERY cow IsA object.
-- [some definitions]
END EVERY cow. -- a comment
Where everything from EVERY
to END EVERY [ID][.]
would be scoped as meta.class
.
While working on it, I had to deal with scopes that where eating up the trailing whitespace. I've learned to test for those spills using scope selectors subtraction:
EVERY cow IsA object.
-- ^^^ meta.class.alan entity.name.class.alan
-- ^^^^^^^^^^^^ meta.class.alan - entity.name.class.alan
The above test-file example snippet checks that the class name scope of cow
doesn't spill over to the trailing whitespace nor the following elements. The test checks that cow
is correctly scoped as both meta.class.alan
and entity.name.class.alan
, while what follows it (space include) should be scoped as meta.class.alan
but NOT entity.name.class.alan
โ which is achieved via subtraction (-
) of the scope.
Textmate's documentation on the topic states:
we can subtract scope selectors to get the (asymmetric) difference using the minus operator.
Testing Meta Scope Exiting
As a further example, I'll show how to test if the meta scope for the class has been duely exited when expected:
END EVERY cow. -- a comment
--^^^^^^^^^^^^ meta.class.alan
-- ^^^^^^^^^^^^^ - meta.class.alan
Here we're checking that everything up to the terminator dot (included) is scoped as meta.class.alan
. If the syntax is working correctly, everything following the terminator dot (excluded) should be scoped only as source.alan
โ therefore, we substract from the (implicit) base scope the meta class scope: - meta.class.alan
.
Note that the following two scope selectors are equivalent:
END EVERY cow. -- a comment
-- ^^^^^^^^^^^^^ meta.class.alan
-- ^^^^^^^^^^^^^ source.alan - meta.class.alan
... except that source.alan - meta.class.alan
is more verbose and rather pointless โ except in cases where the tested syntax might be included in other syntaxes. (see @keith-hall's comments below).
Also note that you can also test with carets (^
) beyond the actual contents of the code line being tested:
END EVERY cow.
--^^^^^^^^^^^^ meta.class.alan
-- ^ source.alan - meta.class.alan
... in the above example the ^
is testing a non-existing character beyond the the dot terminator, nevertheless the scope test is correct (you can try to change the test scopes and verify it yourself). This means that it is actually testing that the meta scope effectively end with the .
terminator.
Note that it's mostly pointless to check the "base" scope, unless you're embedding another syntax, so source.alan - meta.class.alan
can be changed to simply - meta.class.alan
. Then, one can find that if the meta scopes are tweaked slightly, the negative assertion holds less merit and is harder to identify when it doesn't really prove anything any more - it can be more useful to just check for - meta.class
or even just - meta
depending on the circumstances.
And yes, a ^
assertion that points to a character that is after the \n
on the line being tested just asserts against the \n
position, i.e. eol.
Thanks for the clarification @keith-hall , I didn't realize you could actually subtract without declaring a base scope. I'll edit the example to clarify this, but at the same time I think is worth leaving in the example also a full source.alan - meta.class.alan
for learning purposes too, just to show what is really going on. Unfortunately the documentation of scope selectors is really thin, and most existing syntaxes are uncommented. For example, I've noticed more complex scope selector cases in various color schemes or settings, using groups via parenthesis, but it's not easy to work out how these groups are actually affecting scope selection.
Changing Regex Mode in Variables
When declaring variables, don't use (?x)
to switch to multiline / extended mode, instead wrap it in (?x:
...)
to avoid problems where one may expect not to be in extended mode and having to check the variable to know if it changed any options etc. The same applies for ignore case mode as well as any other flags which affect how the regex pattern is parsed.
Making Variables Atomic
Related to the above, and I'm not sure if this has already been mentioned (I didn't find it with a quick search, but am on mobile and didn't try too hard), but it is also useful to wrap variable declarations in a non-capturing group, so that when referencing them with behavior like {{example_var}}{2}
, it is unambiguous and doesn't just repeat the last token declared inside that variable.
i.e. example_var: (?:\w+[.:])
as opposed to example_var: \w+[.:]
In addition to the suggestion to use (?x:โฆ)
:
Use the chomping indicator in block scalars
When using block scalars for your regular expressions, make it a habit to always use |-
(or >-
) with the hyphen chomping indicator to strip the trailing newline. While whitespaces are ignored in extended mode, in variable definitions where you want to ensure the mode only for the variable, it can be detrimental to add a trailing newline to your variable text.
Syntax Tests: Use of & Operator in Scopes
While working on syntax tests, I discovered an undocumented feature, i.e. that it's possible to use the &
operator in the tests scopes. E.g.:
// "string"
//<- punctuation.definition.comment.begin & comment.line
where without the &
you'll get an error of umatched scope, because the scopes are in inverted order (i.e. comment.block
comes first).
I've found this rather useful, especially to keep same scopes aligned with themselves for easy reading the test sources.
I wonder if there are other operators beside -
and &
which can be used in syntax tests.
Syntax tests work with normal scope selectors and thus all its available operators.
Rule loop processing order within a context
As the documentation states, "When a context has multiple patterns (rules), the leftmost one will be found.". It's also important to add that the rule loop is reset when the pattern (rule) matches. So with an example context like:
line:
- match: A # 1
scope: meta.A
- match: B # 2
scope: meta.B
and a line:
AB
The parser will try rule #1
, which will match, then will reset the loop and try rule #1
again. Only then it will progress to rule #2
. The reason for that is that only rules that consume ZERO characters (either with an explicit empty match like ''
, a lookahead or for example a single $
or a ^
pattern) advance the rule loop.
So you can't think of rules as being processed unconditionally from top to bottom and assume, for example, that this would allow for matching tokens in a specific order by arranging the rules in a certain way.
Rule handling around the beginning/end of the line
Once all characters on the line are consumed, and there aren't any more characters to match on that line, the engine will run through the loop once again, matching against a special end-of-line "character". This character can't be explicitly consumed so the only way to get past it is to let the engine go through all the rules for it to be consumed (the rules can still match the end-of-line with patterns like $
but since those are non-consuming, the rule loop will advance and eventually reach the end).
Example:
line:
- match: '$' # 1
scope: meta.eol
- match: 'A' # 2
scope: meta.A
with line:
A
Engine steps:
- does not match rule
#1
and advances to rule#2
- matches character
A
with the rule#2
and reset the loop - (optionally goes through the whole loop matching a newline if one exists)
- matches
EOL
with the rule#1
and since the match didn't consume anything, advances to rule#2
- does not match rule
#2
- the loop ends and if there is another line to match, the engine advances to it
* matches `EOL` with the rule `#1` and since the match didn't consume anything, advances to rule `#2`
I don't think this is necessarily true. My understanding here was that rule 1 matches, consumes no characters, and then ST forcibly moves to the next character to prevent an infinite loop. The notable difference here is that rule 2 isn't tried. You can test that with a different non-consuming pattern like (?=.)
(or an empty pattern).
Edit: This has turned out to be not true, as evident by the following syntax definition marking everything as invalid
:
- match: (?=.)
- match: .
scope: invalid.test
It would make sense to move these tips into a ./CONTRIBUTING.md
file of this repo to keep them up-to-date with the current release on the dev
channel.
It would make sense to move these tips into a ./CONTRIBUTING.md file of this repo to keep them up-to-date with the current release on the dev channel.
A good solution would be to create a project Wiki for this repository (see my proposal at #1522), and then move the tips into the Wiki. The advantage of using a Wiki is that it allows to organize the topics into multiple pages, and it can be set to be editable by anyone.
The only downside I can think of is that although repo Wikis are repositories, only collaborators can push changes, so page editing by non-collaborators has to be done via the WebUI.
Also worth mentioning, the following repository was recently (Mar. 2020) created to address scope naming guidelines:
https://github.com/SublimeText/ScopeNamingGuidelines
But as of today the project seems stale (three commits only).
Another older (and stale) project along those lines:
https://github.com/gwenzek/sublime-syntax-manifesto
I think that using the Wiki of this project would be preferable, since this is an official ST repository, whereas third party repositories might not receive the same attention.
But as of today the project seems stale (three commits only).
There was work in a pull request which I have just merged, if you want to check it out. It could definitely move faster, but it not meant to take the tips and advices of this issue here.
Instead, I would rather add them to the community docs at https://github.com/sublimetext-io/docs.sublimetext.io in a more orderly fashion, or, as you suggested, as a page on the wiki here. I think that the cdocs would be more appropriate, since the tips apply to syntax definition development in general and not the default packages specifically.
Use plural names for non-popping contexts and signgular for popping contexts.
Plural makes clear an included context can handle multiple tokens without popping or setting away from current context. (It may push another context onto stack though.
Singular makes clear the current context to be left as soon as a single token is matched, either by popping or setting another context onto stack.
By doing so, context names such as ...-pop
can be avoided. This is useful as some syntaxes consist of nearly only popping contexts, so we would find -pop
in nearly every line.
Example 1
A strings
context can be included to handle arbritary number of quoted string tokens without leaving the context they are included in. The string-content
(singular) is popped as soon as the closing quotation mark or illegal eol is matched.
literal-double-quoted-strings:
- match: \"
scope: punctuation.definition.string.begin
push: literal-double-quoted-string-body
literal-double-quoted-string-body:
- meta_include_prototype: false
- meta_scope: meta.string string.quoted.double
- match: \"
scope: punctuation.definition.string.end
pop: 1
- include: illegal-newline
- include: literal-string-escapes
Example 2
The following context (singular) is to handle a single statement (method declaration) by matching each possible term one after another.
member-maybe-method:
- meta_include_prototype: false
- match: ''
set:
- method-block
- method-attribute
- method-array-modifier
- method-signature
- method-modifier
Use primarily (only) named contexts.
Reasons:
- Only named contexts can be extended or overridden by an inheriting syntax definition
- ST's scope name popup (ctrl+shift+p) displays the name of the context a token is matched by. If a syntax contains only named contexts (and wasn't pushed
with_prototype
), it is much easier to debug highlighting issues as the causing context can be identified much easier.
When creating contexts ask your self, what you'd expect from a syntax if you'd need to inherit from it to create your own extended variant.
Is it easy to override certain rules?
Which contexts must exclude possible prototypes? (- meta_include_prototype: false
)
Does your syntax support string interpolation? Can an inherited syntax easily implement it?
See HTML.sublime-syntax for instance.
Use variables for large lists of fixed tokens (builtin functions etc.), because
- they can easily replaced by inheriting syntaxes
- contexts itself keep readable by avoiding large blocks of patterns.
See: CSS.sublime-syntax for reference.