/tspm

An open source tree-sitter package manager 🌲

Primary LanguageNixMozilla Public License 2.0MPL-2.0

TSPM 🌲

CI

An open-source tree-sitter package manager

Status

Currently, TSPM is a medium-size collection of grammars hosted in

It's unstable for now and entirely focused on preparing packages for use in Helix. See the scope docs for more information.

Registry

TSPM publishes artifacts to a CDN-like registry. Each item in the registry takes the shape

https://pkgs.tspm.io/<language>/<owner>/<revision>-<tree-sitter-version>-<abi-version>-<checksum>-<format>
  • language - the language the grammar parses, for example: typescript
  • owner - the owner of the grammar as declared in grammars.toml
  • revision - the git revision of the grammar, for example: b3d4a7f14537ecb1eedc75d5e273dd3ce2887df5
  • tree-sitter-version - the version of tree-sitter-cli used to generate the artifact, for example: 0.20.6
  • abi-version - the tree-sitter ABI version number used to generate the artifact, for example: 13
  • checksum - a base 16 representation of the sha256 sum of the generated artifact, for example: 956eb868f38544dedcd5ef45f0ecc7cab542c68b89fae631e149008ad5cc72e8
  • format - the format in which the grammar was generated, for example: src.tar.gz
    • currently only src.tar.gz is published, which is a gzip-compressed GNU tar archive of the src/ directory generated with tree-sitter generate

A listing of available artifacts can be found on https://pkgs.tspm.io (this UI will improve).

The package registry is S3 compatible: any HTTP or S3 client is capable of downloading artifacts. For example, let's download a grammar with curl, verify its integrity with sha256sum, and open it up with tar.

$ curl -o elixir.tar.gz -sSL https://pkgs.tspm.io/elixir/elixir-lang/a11a686303355a518b0a45dea7c77c5eebb5ec22-0.20.6-13-956eb868f38544dedcd5ef45f0ecc7cab542c68b89fae631e149008ad5cc72e8-src.tar.gz
$ sha256sum elixir.tar.gz
956eb868f38544dedcd5ef45f0ecc7cab542c68b89fae631e149008ad5cc72e8  elixir.tar.gz
$ mkdir src
$ tar xzf elixir.tar.gz -C src
$ tree src/
src
├── grammar.json
├── LICENSE
├── node-types.json
├── NOTICE
├── parser.c
├── scanner.cc
└── tree_sitter
    └── parser.h

1 directory, 7 files

Now in our elixir directory we have the files generated by tree-sitter generate and any licensing files. We can build the grammar with a C/C++ compiler like so:

$ CFLAGS="-I src/ -g -O2 -fPIC -fno-exceptions"
$ c++ -c src/scanner.cc -o scanner.o $CFLAGS
$ cc -c src/parser.c -o parser.o $CFLAGS
$ cc -shared -o elixir.so *.o

Now the elixir.so shared object is ready for use!

Packaging a new grammar

Each new grammar needs an entry in grammars.nix and its versions locked in grammar-lock.json. Say we're packaging elixir-lang/tree-sitter-iex. First, we'll add a section to grammars.nix with the license:

{
  # ..
  iex.elixir-lang = tspm.grammar { meta.license = lib.licenses.asl20; };
}

Then we'll add the package to the lockfile:

$ nix run .#lock -- elixir-lang iex

Use nix flake check to verify that the grammars pass tests.

Motivation

Currently, tree-sitter grammars are distributed using git repositories, which places the burden of writing a well written package on the grammar authors. This is a bit problematic because:

  • grammar repositories typically contain items like documentation, queries, screenshots, tests, etc. that are unnecessary in packages but are good for the grammars themselves
  • grammar authors do not usually have any reason to update tree-sitter versions, which means generated parser files may fall behind when breaking ABI changes happen in tree-sitter

TSPM focuses on the packaging aspect, reducing the operational burden of maintaining a grammar. If TSPM becomes widely adopted by tree-sitter consumers, there may no longer be a need to commit generated files in grammar repositories at all.

Scope

TSPM's current focus is to optimize grammar packaging for Helix. A minimal goal for TSPM is to act as a package registry for grammars' src/ directories. Hosting compiled parser artifacts (.so and .dll files) is probably also within scope, but brings its own challenges (particularly around sourcing compute for less popular architectures). Packaging for queries alongside their grammars is also desired, but there are no concrete implementation plans at the moment. Depending on how TSPM is intended on being used by tree-sitter consumers, a CLI client for the registry (probably called tspm) which downloads, compiles, and cleans grammars may be in scope.

Some goals are out of scope for now:

  • semantic versioning of grammars
    • grammars tend to make breaking changes very often, so this is actually probably not a good idea
  • security guarantees
    • this would certainly be nice to have, but ultimately it is difficult to ensure any grammar does not execute arbitrary code - grammars could hide such things in external scanner implementations, and manual review is currently the only tool to protect against such abuses
    • the "Native Library, WASM parsers" part of tree-sitter#930 could address this
  • package download counts
    • I'd be open to this if TSPM becomes well adopted and it's not too expensive to track

Why Nix?

Nix is a tool for declarative package management. It is known for its use in large-scale package registries like nixpkgs, but is general enough to be used to write new package registries.

Technically, all packaging currently done in TSPM could be accomplished through Makefiles or shell scripts. There is some variance between how tree-sitter grammars are structured in the wild, though. Some need dependencies from local directories, submodules, or NPM. One in particular had a grammar.js written in Typescript instead (i.e. grammar.ts) until recently. These variances are in scope for TSPM. Using Nix allows us to more easily write custom builders and plug in custom options with reasonable defaults.

Nix also sets up network and file-system sandboxing during builds, which is necessary when packaging tree-sitter grammars because a grammar.js may contain arbitrary code.

Infrastructure

Currently, TSPM uses the following infrastructure:

  • GitHub Actions - for automations like building grammars and wasm bindings
  • GitHub Pages - for hosting the playground
  • DigitalOcean Spaces - hosts artifacts and provides edge caching

Costs

The current monthly cost of TSPM is estimated at 5 USD (the base cost of a DigitalOcean Space).

The current pricing model for Spaces:

| Storage | Outbound transfer | Additional GB Stored | Additional GB Transferred | USD/month |
| 250GB   | 1TB               | $0.02/GB             | $0.01/GB                  | $5.00     |

Current space usage: 21MB.

Other docs

Looking for more docs?

TODO

  • Use CI runners with more resources to accommodate more grammars
  • Allow locking NPM dependencies via the grammar-lock.json file
  • Add an app that writes a JSON index of all packaged grammars
    • Write that index to the gh-pages branch and use jq to accumulate packages as they are generated
  • Write a landing page for TSPM which allows one to search through the index and copy pkgs.tspm.io links

License

TSPM is licensed under the MPL-2.0. See the LICENSE for details.