tree-sitter/tree-sitter-c-sharp

Move `src/parser.c` to git LFS

tamasvajk opened this issue · 24 comments

I started receiving the below warning on pushes:

remote: warning: See http://git.io/iEPt8g for more information.
remote: warning: File src/parser.c is 50.90 MB; this is larger than GitHub's recommended maximum file size of 50.00 MB
remote: warning: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.

I'm not sure what would be the impact of moving to LFS. Maybe some documentation would need to be updated.

Sjord commented

We could reconsider whether parser.c should be committed at all. It is auto-generated with tree-sitter generate, so I am not sure it should be in the repository.

Not committing it would be even better. But I was under the impression that this is a de facto requirement of a tree sitter grammar repo, because this way all you have to do is check out the grammar’s git repository into a well-known location1. I'm not sure if this is a hard requirement or if we can deviate from it.

Footnotes

  1. https://dcreager.net/tree-sitter/getting-started/#installing-a-grammar

@damieng I just checked the recent size increases of src/parser.c. It looks like 168390c increased it from 30 to 50MB. It's most probably because of the [$.assignment_expression, $._expression], conflict. If I remove it, this is the message I get from tree-sitter:

❯ tree-sitter generate                 
Unresolved conflict for symbol sequence:

  '*'  _lvalue_expression  •  '='  …

Possible interpretations:

  1:  '*'  (_expression  _lvalue_expression)  •  '='  …
  2:  '*'  (assignment_expression  _lvalue_expression  •  assignment_operator  _expression)  (precedence: 0, associativity: Right)

Possible resolutions:

  1:  Specify a higher precedence in `assignment_expression` than in the other rules.
  2:  Specify a higher precedence in `_expression` than in the other rules.
  3:  Specify a left or right associativity in `_expression`
  4:  Add a conflict for these rules: `assignment_expression`, `_expression`

Why is there a conflict? Why isn't '*' _lvalue_expression reduced to _pointer_indirection_expression which has higher precedence than assignment_expression?

EDIT: For future reference, I asked the same question in tree-sitter/tree-sitter#2024.

I don't know enough about how tree-sitter resolves these precedence rules to be helpful here unfortunately (every time I read the docs I feel like I've got a good handle on it but then once I dig into our non-trivial rule set it escapes me)

Whatever we do here it might be worth adding some kind of file size echo to a text file so that we can see at a glance on PR's how a change affects large generated file sizes.

Re: parser.c and friends being part of the repository

I think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under src (src/parser.c and friends) fetched from grammar repositories.

In nvim-treesitter's case I think there may be logic to call tree-sitter generate under certain circumstances but that means a user needs to have the tree-sitter cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.

In Emacs 29+'s case I don't think there is any logic ATM to run the tree-sitter's generate subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696

Re: parser.c and friends being part of the repository

I think nvim-treesitter and Emacs both have logic in them to compile shared libraries for parsers from files that live under src/parser.c and friends fetched from grammar repositories.

In nvim-treesitter's case I think there may be logic to call tree-sitter generate under certain circumstances but that means a user needs to have the tree-sitter cli installed IIUC. Not sure of the specifics but I believe @theHamsta knows.

In Emacs 29+'s case I don't think there is any logic ATM to run the tree-sitter's generate subcommand: https://github.com/emacs-mirror/emacs/blob/f4f30ff4c44dcfdf780f1981aa541af713f2805f/lisp/treesit.el#L2684-L2696

We mostly use tar-balls of individual commits and download them via curl. With LFS, I don't know whether src/parser.c will end up in the tar archive (https://github.com/nvim-treesitter/nvim-treesitter/blob/master/lua/nvim-treesitter/shell_command_selectors.lua#L215-L216). If src/parser.c is missing, then we would mark the parser as requires_generate_from_grammar and users then need to have the tree-sitter-cli installed, which is already the case for some parsers.

It looks like difftastic uses tree-sitter-c-sharp and assumes src/parser.c is available.

Not really sure though...but I would be surprised if @Wilfred didn't know :)

I'm not sure how cursorless does its building, but I think it also transitively(?) uses tree-sitter-c-sharp.

There's a custom web-tree-sitter mentioned so may be Emscripten is involved in compiling parser.c and friends to produce .wasm.

May be @pokey knows the details.

pokey commented

Here's where we build the wasm:

https://github.com/cursorless-dev/vscode-parse-tree/blob/main/Makefile#L43

So, as long as parser.c is in the npm package, I think we're ok. Fwiw we do intend to migrate all our tree-sitter language dependencies to use GitHub commit sha's instead of npm versions, because the npm versions tend to be stale. I believe that means we'd run into some issues if you stopped including parser.c in the repo

That being said, I think we could probably add a step to generate parser.c before we compile the wasm, so should be ok.

But I do believe every other tree-sitter grammar repo keeps the generated parser.c in source control, so it may come as a surprise to other downstream consumers if it does not

I think this should be a discussion across tree-sitter grammars in general. Having tree-sitter-c-sharp be different/unusual from other grammars in this respect will not help adoption.

Unfortunately we don't really have a broad tree-sitter discussion channel. Perhaps we should start one.

I've create a Discord one if anyone is interested https://discord.gg/KrRa5ATb

pokey commented

Agreed. A lot of discussion seems to happen on https://github.com/tree-sitter/tree-sitter/discussions, so it's probably worth asking there. Agreed a Discord server could be useful, though. I just asked to see if there already was one (tree-sitter/tree-sitter#2027)

grafik

There is an option to include Git LFS in archives, which would maybe a possibility for everyone using tree-sitter-c-sharp via tarball. When anyone is using Git, the Git LFS requirement would propagate to the consuming repo.

I found this related discussion: tree-sitter/tree-sitter#1243.

I tried tree-sitter generate for tree-sitter-c-sharp and it really takes quite a while doesn't it:

$ time tree-sitter generate

real	4m2.854s
user	3m52.457s
sys	0m10.372s

Memory usage exceeded 50GB here too...

Yes, it has grown quite considerably lately. We're going to have to make some trade-offs between handling every single scenario possible in C# vs performance I suspect.

Somewhat related, I noticed this issue at the tree-sitter-swift repository: alex-pinkus/tree-sitter-swift#149

That's a pretty interesting solution, I'd be up for doing that here. I wonder if we can proactively reach out to dependencies we know will be affected.

The issue of it being non-trivial to contact users of one's grammar seems to be the sort of thing that many grammars will / do face.

I wonder if there is a good approach to this.

We started a sticky issue where we have began to announce potential upcoming changes and asked the users we know about to subscribe to it and let us know if things we're planning on could be a problem: sogaiu/tree-sitter-clojure#33

We tried to ascertain who was using us and then went around to the folks we came up with, but later it turned out there were other folks.

From the perspective of a "user" (e.g. nvim-treesitter, Emacs 29+, difftastic, helix-editor, etc.) though, I would imagine that there would be a preference to not have different methods of being up-to-date about upcoming changes. Some kind of unified way seems like a better deal.

I wonder if there's something that can be done at the tree-sitter repository...like a dedicated thread(?) / discussion per grammar? (I haven't found the Discussions UI / UX to be that great though -- stuff seems to get easily hidden and hard to search -- though possibly that has changed over time.)

I wonder then if we flip what tree-sitter-swift did on its head and instead move development to a new dev branch that PR's need to be made against that has no generated files.

We'd then adopt the tree-sitter-swift approach that when a release is tagged we generate the files and merge them into master (rather than with-generated-files)

That way everyone who is relying on master and the generated artefacts today is unaffected other than lagging behind active development.

Given there's only a small handful of contributors to the repo the work in us changing from developing against master to developing against dev or develop would be quite small.

Why not the other way around? Unless the user knows they can expect generated files to be inside the repository they don't expect it. Also updating a generated file adds major noise to the commit history.

Improving the memory used to generate this grammar would be great. I'm at 26.5GB right now for some architectures I don't even have the memory to generate the grammar (Arm and PPC).

We'd love to be able to improve the memory usage but the only options I'm aware of there are either:

  1. Reduce the amount of C# syntax we support
  2. Make changes to the core tree-sitter engine

We really don't want to do option 1 as it would be a breaking change and make this library less useful. As for option 2 I wouldn't know where to start and I suspect the only people who do are really busy on other things.

I get the point, I wanted to highlight the amount of memory required to generate the current grammar.

If it only would be possible to generate parts of the grammar in a separate process.

It's definitely something that needs to be done but that would happen in tree-sitter itself not the C# grammar.