Auto-formatter?
matthewhuang97 opened this issue · 5 comments
Hi Evan! Was wondering if you have any thoughts on what an auto-formatter (e.g. Prettier) for Skew code might look like. I was thinking it might be fun to work on something like that for maker week or on the side even.
Feel free to close this issue whenever!
That depends on what type of formatter you want to build.
Formatters can get very complicated. This blog post about Dart's formatter is a good example of how complicated they can get. I think most of the difficulty comes from line wrapping.
Personally I prefer formatters that are simple and predictable and don't do line wrapping. The JavaScript formatter built into VSCode is a good example. It just fixes indentation and adjusts whitespace between operators, but never inserts or removes newlines.
If you want to try a formatter like that, one way to do it might be to use the Skew compiler's lexer. Skew converts the whole file into tokens before any parsing starts, so you can just stop there and get a list of tokens for the file. See tokenize()
in token.sk. Then you would essentially just print the tokens back out to a file with the appropriate whitespace in between them.
Skew's lexer tracks newlines and comments so all of the information should be there. Comments are currently converted from tokens to a property on the next token, but that should be easy to undo if you'd like.
You can track indentation using a stack of the brackets you're currently inside of. Skew's lexer also resolves angle brackets used in generic type parameter lists to PARAMETER_LIST_START
and PARAMETER_LIST_END
so you won't have to figure out for yourself how to distinguish those from LESS_THAN
and GREATER_THAN
tokens.
Newlines are currently removed by the lexer inside multi-line statements (see the comment starting with "remove newlines"). I assume you'll want to preserve them instead but temporarily increase the indentation for the rest of the statement. You could consider setting a flag on the next token indicating that it's a line continuation.
That's a quick sketch of how a simple formatter might work. It's how I've thought of doing this in the past when I've wanted a formatter for Skew.
If you're going to try for something more complicated, then you may need to have some form of AST that has information about higher-level syntax constructs but that doesn't discard comments. Skew's AST isn't really appropriate for this. It preserves most comments but not all of them (it's just a best-effort thing for pretty output). I've usually seen these ASTs be some form of hierarchical token tree. In that case you may only be able to reuse Skew's lexer and you may have to write your own parser or at least heavily modify Skew's parser.
Good luck! Let me know if you have any more questions. Happy to help.
Ahh thanks, this is super detailed + helpful! I didn't realize that line-wrapping AST formatters were so complicated. I wasn't aware that "print the AST nicely" had so much hidden danger behind the word "nicely", when it came to line breaks :)
It definitely makes sense to start with a no-linewrap approach for what I'm thinking of. Everything you wrote up is going to be super helpful -- I'll keep you updated if I end up making significant progress on this!
I made some progress on this! Your unit tests helper class is amazing, BTW 😄
The lexer doesn't distinguish between colons as part of a map, and those as part of a ternary. The lexer also doesn't know if a brace is for a function body or for a map. We could do something like, "if we're in a ternary..." by setting a flag when a "?" is encountered, but then I don't think we have a good way of knowing when the ternary is over (I imagine true ? { 0: 1 } : null
is a pretty tough nut).
Of course we could just make maps formatted as {0 : 1}
but that's probably the lazy way 😅 and I wouldn't be too surprised if similar scenarios came up.
This makes me think we might need to get some help from the parser. What do you think? I think I'm coming up on some similar challenges that might require the help of the parser, i.e. how do we process multi-line statements that look like:
if
true { }
So I'm checking in with you to see if you have any quick thoughts off the top of your head before going in too deep! Thanks!!
A simple thing to do for colons to get past this issue for now is just to preserve whether there is whitespace on either side of the token. That will let you move on without messing up source files.
A simple heuristic for statements is to push the statement on the stack if the line starts with a statement-starting keyword (e.g. if
). Then consider the if statement in progress until you get a {
at the same level (not wrapped in parentheses) followed by a }
later on. It's not perfect but it should work most of the time.
As a longer-term fix, you could consider running the parser and then throwing out the AST but having the parser "annotate" the tokens during parsing. You'd need to decide what to do if the file can't be parsed (say, the user is still in the middle of editing it and the syntax is currently invalid). I remember adding some error recovery to Skew's parser but it may not be robust enough for this use case. You could either just not format it at all when it doesn't parse, or you could still format it using the token heuristics but just not have the token annotation information.
What do you think?
Got it, I think I like the idea of working with the parser more. And yeah, I agree it'll be much easier to work on files that pass parsing.
On the note of AST / the parser, I checked in with my teammates today, and it seems like people would really love something more powerful with line-wrapping -- and someone showed me https://prettier.io/docs/en/plugins.html. I might try to take a stab at getting some sort of AST to pipe into Prettier, which would then do a ton of the work for this!