tree-sitter-types-builder

This tool is a helpful utility for developers to generate every .type of possible SyntaxNode that can be found in a tree-sitter grammar, as string literals. Even in most small languages, the number of SyntaxNode types can be quite large (well into the hundreds of definitions). While many of the definitions are redundant (after analysis provided by tree-sitter), it is much easier to remove these types than to find what types will be needed.

Usage/Installation

Global Installation
  1. Install the package globally (using your preferred package manager)

    # npm installation
    npm i -g tree-sitter-types-builder 
    
    # yarn installation
    yarn global add tree-sitter-types-builder
    
    # pnpm installation
    pnpm add --global tree-sitter-types-builder
  2. Use tree-sitter-types-builder command where needed

    # in some project with a wasm file
    tree-sitter-types-builder --wasm path/to/your.wasm --language your_language --output path/to/your/types.ts 
Local Project Installation

Note:
requires web-tree-sitter, and tree-sitter-cli.

  1. Install inside package inside project

    pnpm install --save-dev tree-sitter-types-builder
  2. Build a wasm file

    # for example, to build a wasm file for the bash language
    npx tree-sitter build-wasm ./tree-sitter-bash

    This will create a tree-sitter-bash.wasm file in the tree-sitter-bash directory

    # for newer tree-sitter-cli versions
    npx tree-sitter build --wasm ./tree-sitter-bash
  3. Run the command for your language

    npx tree-sitter-types-builder --wasm path/to/your.wasm --language your_language --output path/to/your/types.ts

    edit the generated types to fit your needs

Example (TS) | Introduction

The recommended ✔️ example below, assumes that you have already compiled a wasm file for your language and have generated the types. It also assumes that you are using web-tree-sitter to parse your code. If you have completed these steps, you can now use the generated types to build any features for your language

import { SyntaxNode } from 'web-tree-sitter';
import { LangNodeType } from './types' // generated by tree-sitter-types-builder

// 1.) initialize parser for a language
// 2.) parse some code to get the Tree of SyntaxNode's from web-tree-sitter
// 3.) build features, by selecting nodes of interest using the generated LangNodeType

function findChildOfType(rootNode: SyntaxNode, type: LangNodeType): SyntaxNode | null {
  if (rootNode.type === type) return rootNode;
  for (const child of rootNode.children) {
    const found = findChildOfType(child, type);
    if (found) return found;
  }
  return null;
}

// now you get auto-completion for LangNodeType.FunctionDeclaration
// and avoid passing incorrect strings to the function
findChildOfType(rootNode, LangNodeType.FunctionDeclaration);

This process automates potentially error-prone manual work and makes the code more robust. It also makes the code more readable and easier to maintain. A tree-sitter-{lang} maintainer can now update their grammar without breaking the code of their users.

Unrecommended ❌ way of using tree-sitter, without the generated types below:

Brief outline displaying how quickly exact context/naming of types, tree-sitter-api requires

import { SyntaxNode } from 'web-tree-sitter';

function findChildOfType(rootNode: SyntaxNode, type: string): SyntaxNode | null {
  if (rootNode.type === type) return rootNode;
  for (const child of rootNode.children) {
    const found = findChildOfType(child, type);
    if (found) return found;
  }
  return null;
}

// now, the user must test the exact string into the findChildOfType function
// and will not be able to get auto-completion for the type of node they are looking for.
findChildOfType(rootNode, 'function_declaration');

// Furhtermore, consider implementing features that require multiple types of
// nodes to be selected. The context of the code will be much harder to understand
// and properly deduce. 
function findUnreachableCode(rootNode: SyntaxNode): SyntaxNode | null {
  const functionNode = findChildOfType(rootNode, 'function');
  const blockNode = findChildOfType(functionNode, 'block');
  const returnNode = findChildOfType(blockNode, 'return_statement');
  // check for returnNode's to have siblings after them, within the current
  // block scope
  return returnNode;
}

Did you catch the potential bug in the above code? Depending on the language, a function might not have anything other than the identifier for the function name (common in shell languages). The block node would also potentially also just be for the keyword of the block-scope.


How do the generated types help? (ADVACNED COMPARISON)

Auto-completion/Intellisense/GoTo-References

Using this package will give you language features, project wide. This is useful for adding other features later, especially if they require similar implementations/node-types to your currently completed features. You can use a goto-refrences request on a LangNodeType to see all the places where that specific node has been used.

  • Wide Type Definition in tree-sitter API
  • Generated type definitions provide a string literal for each type of node

Extensiblilty & Ambiguity

Context wise, you can also extend the types generated by the tool to include additional type-narrowing. For example, only allowing a specific set of nodes to be searched for is much clearer to define in as a singular new type definition.

✔️ Generated Usage ✔️
export type BlockScopeNode = LangNodeType.Block | LangNodeType.FunctionDeclaration | LangNodeType.IfStatement | LangNodeType.WhileStatement;
❌ Non-Generated Usage ❌
// no auto-completion for the types of nodes that can be used
// no reference to where the type is used (for block_statement, function_declaration, if_statement, while_statement)
export type BlockScopeNode = 'block' | 'function_declaration' | 'if_statement' | 'while_statement'

// if another type-narrowing intends to use an overlaping type, the tree-sitter
// API can easily hide using the wrong the string meant for the type
export type StatementScope = 'block_statement' | 'if_statement' | 'while_statement' | 'for_statement'

Easy Testability & Maintainability

Allows for the indented types of nodes to be selected, and tested before new maintainers approach the code. Consider the following example, where you are comparing two nodes that might correspond to similiar string values (this could be different forms of whitespaces, comments, or even something like block vs block-scope).

Example Test File
import Parser, { SyntaxNode } from 'web-tree-sitter';
import { LangNodeType } from './types.ts';

function nodeMatchesType(node: SyntaxNode, type: LangNodeType): boolean {
  return node.type === type;
}

const nodeA = LangNodeType.block;
const nodeB = LangNodeType.blockScope;

function getInOrderNodes(rootNode: SyntaxNode, collectedNodes: SyntaxNode[] = []): SyntaxNode[] {
  collectedNodes.push(rootNode);
  for (const child of rootNode.children) {
      if (child) getNodes(child, collectedNodes);
  }
  return collectedNodes;
}

for (const node of getInOrderNodes(rootNode)) {
  if (nodeMatchesType(node, nodeA)) {
    // do something with nodeA
  } else if (nodeMatchesType(node, nodeB)) {
    // do something with nodeB
  }
}

// can also use the namespace getKeys() function to iterate over all the types
LangNodeType.getKeys().forEach((key) => {
  const node = LangNodeType[key];
  if (nodeMatchesType(node, nodeA)) {
    // do something with nodeA
  } else if (nodeMatchesType(node, nodeB)) {
    // do something with nodeB
  }
});

The project's maintainability is the core reason for the creation of this tool. In a project where I used tree-sitter to parse a language and did not separately define the types of nodes, the complexity of not separating the tree-sitter-wasm API from the rest of the code was a major issue. Refactoring a project of large scale, without the SyntaxNode types statically defined becomes exponentially more difficult as the project grows.

Consistency

This file can be used to check for equivalent type conversions across different apis. This is an important feature for project that might grow very large. Keeping the relevant types in a location that can be easily navigated to is a good practice for any project.

Further Reading

The syntax generated by this tool is based on the type definitions in the language server protocol and the exploits the Type system's ability to extend types with additional properties/functions (through the use of a namespace). This allows the type definitions to be more expressive by allowing for them to be iterated over, while keeping their ability to be statically referenced.

The specific type definitions use a string literal to represent the type of SyntaxNode that is being referenced. Not onlyd does this help abstract the tree-sitter API from the user, but it also allows for the type definitions to be more expressive by displaying all type definitions in a single place.

This would be especially useful for developers who are just beginning a project that uses the tree-sitter API. They can now easily see all the types that are available to them, and can easily determine which types they need to use. Properly defining the set of SyntaxNode types relevant to the features of the project is a much clearer method than having to rely on the very wide type definition it corresponds to from a tree-sitter's parser.

Conclusion

This projects aims to provide a clear and testable method for building a feature rich set of language features from a tree-sitter grammer. It also can be helpful to keep this tool on hand to check for name changes across releases of a languages grammar.

License

MIT