rdbende/chlorophyll

Use tree-sitter

Moosems opened this issue · 9 comments

[15:00]	Moosems	Actually not :)
[15:00]	Moosems	His chlorophyll package
[15:00]	Arrinao	oh she did
[15:01]	Moosems	https://github.com/rdbende/chlorophyll/pull/23
[15:02]	Akuli	porcupine has one of these too
[15:03]	Moosems	Yep
[15:04]	Akuli	in porcupine, the first step is to turn all changes in the text widget into nice Change objects: https://github.com/Akuli/porcupine/blob/main/porcupine/textutils.py#L22-L60
[15:04]	Akuli	e.g. Change(start=[1, 0], old_end=[1, 5], new_end=[1, 4], old_text='hello', new_text='toot') means replacing 'hello' with 'toot' at the start of the file
[15:04]	Moosems	Yep
[15:04]	Akuli	the code that constructs these is puke :D
[15:05]	Akuli	it basically has to implement everything that the insert, delete and replace manual pages describe
[15:07]	Akuli	it seems to assume that you can just start lexing anywhere you want, at the start of any line?
[15:07]	Moosems	?
[15:07]	Akuli	it basically puts self.get(f"{start_line}.0", f"{end_line}.end") to pygments.lex()
[15:08]	Moosems	Yes, it only lexes what it needs to
[15:08]	Akuli	that won't work when you have a multiline string
[15:08]	Moosems	Already an issue I opened :)
[15:08]	Moosems	Which is what highlight_area() will help solve
[15:09]	Akuli	i have already "solved" it in porcupine about a year ago
[15:09]	Akuli	i say "solved" because it turned out to be a really hard problem
[15:09]	Moosems	By finding out when they start and end?
[15:09]	Akuli	i wanted it to work for all languages that pygments supports, so hard-coding something for python wasn't a solution
[15:09]	Akuli	this also applies to e.g. multiline comments in c
[15:10]	Moosems	Thats why I think the user should add multi line strings and docstrings as a parameter
[15:10]	Moosems	Like in DIP
[15:10]	Akuli	the best you can do (as far as i can tell): figure out when the lexer's internal state is same as its starting state, and mark those places: you "can" start lexing again from any one of them
[15:10]	Moosems	And if its an empty string it assumes theres no docstring type
[15:10]	biberao	Akuli: https://www.youtube.com/watch?v=WpAY8TGt2Ks
[15:10]	Akuli	i say "can" because even that doesn't work in all cases
[15:11]	Moosems	It's a copmlicated issue
[15:11]	Akuli	yes :)
[15:13]	Moosems	I wonder how VS Code does it
[15:13]	Akuli	to me the solution was to switch to a different highlighting library: tree-sitter
[15:13]	Moosems	tree-sitter?
[15:13]	Akuli	with tree-sitter you say "text from 12.34 to 56.78 was previously 'blah blah' but it is now 'blah blah'"
[15:14]	Akuli	it is designed to be used in an editor, unlike pygments
[15:15]	Moosems	How do I use it? Could you help make an MRE?
[15:15]	Akuli	sorry, i'm not going to spoon-feed you something that took me days to figure out
[15:15]	Moosems	XD
[15:16]	Akuli	it's not very straight forward because tree-sitter isn't a library just for syntax highlighting, it gives you a parse tree that contains things like "function definition" instead of things like "the def keyword"
[15:16]	Moosems	I noticed
[15:16]	Akuli	then it's your job to turn that syntax tree into whatever you want
[15:16]	Akuli	in porcupine i set up yaml files that describe how to do this, and there's one for every tree-sitter highlighted language
[15:17]	Akuli	note that porcupine still supports highlighting with pygments, it just isn't the default in e.g. python
[15:17]	Moosems	It doesn't have nearly as many languages
[15:17]	Akuli	yeah
[15:17]	Akuli	in porcupine the idea is to use tree-sitter most of the time, and fall back to pygments when a user has an exotic language
[15:18]	Akuli	this works well because your files are typically small and somewhat simple when you work in an exotic language, for any "real work" you tend to use a popular language instead
[15:19]	Moosems	Care if we steal the idea?
[15:19]	Akuli	go ahead :D
[15:19]	Akuli	you can take all the code if you want, of course :)

Ohh, what did I miss :)

But yeah, I'm aware of tree-sitter, and I even made a couple of this yaml files in Porcupine.
However I put together tkcode in a couple of hours without the intention to ever maintain it (that's why I'm sometimes a bit ignorant about this project), and after i heard about tree-sitter, I didn't really care.

I actually really like this package which is why I'm so persistent to fix all the bugs. I updated the PR and fixed the paste issue, will you join IRC today? Akuli is on right now.

However if you want to see this feature, go ahead! :))

I believe this feature could be really awesome if done in a Rust backend using PyO3 and the tree-sitter Rust bindings. Will have to figure out how to make it work for those who don't have Rust installed.

What do you mean by "done in a Rust backend"? Why would we need that?

Because the more that's done in Python, the slower this will be. To parse all the data in tree-sitter there's a decent few for loops (I believe a few are nested too) and in Python it is well known that such a practice is unbearably slow.

So we need to make a parser from highlights.scm to the pygments token.

With the plan to highlight only whats visible, this is unnecessary.