lezer-parser/lezer

Question: How to query "syntax tags" for a given Node/NodeType?

joshgoebel opened this issue · 19 comments

I'm trying to use the library at a little bit of a low-level (or mid-level)... I don't think lezer/highlightTree is what I want...

I'm just trying to iterate over a parsed tree (I'm using https://github.com/lezer-parser/javascript)... and what I want is the content as well as the highlight specific tag type for each node... it seems one can walk the tree carefully with cursors (and keep track of indexes into the content), but I'm still left with how to get the highlight tag of a given Node... for example when walking the tree a template string in JS is the following type:

NodeType {
  name: 'TemplateString',
  props: {
    '2': [ 'Expression' ],
    '7': Rule { tags: [Array], mode: 2, context: null, next: undefined }
  },
  id: 41,
  flags: 0
}

But from the lezer/javascript source I find that TemplateString maps to t.special(t.string):

export const jsHighlight = styleTags({
  TemplateString: t.special(t.string),

And jsHighlight is registered with the parser as a propSources...

So, what I want to know is how, given a Node or NodeType, how can I get access to t.special(t.string) so I can use it to do comparison at the lezer/Highlight meta tag layer rather than dealing with the idiosyncrasies of every grammar's own internal names?

From reading further (highlightRange) I thinker I need minimally:

// get the correct prop with tag info
let rule = type.prop(ruleNodeProp)
// loop over rules via `rule.next`
cursor.matchContext(rule.context)) 
// use highlightTags for it's convenient mapping functionalities

And ruleNodeProp is where the magic 7 is hidden, but that constant isn't exported... so I'm not sure how I'm supposed to access this data - or if I'm simply not supposed to.

Rules and rule-matching is currently internal to @lezer/highlight. What are you looking to do with the tags?

I need to know the tags/classification for a given section of code so I know what to output.

I'm attempting to use Lezer grammars as the parser/tokenizer component and then plug that analysis back into our highlighting output pipeline.

highlightjs/highlight.js#3623

There is not really a simple function from a node to a tag, since tagging rules support opacity and inheritance to child nodes, determining the styling for a given node involves iterating over the tree, keeping around inherited tags and stopping iteration on opaque nodes. What are you trying to output?

support opacity and inheritance ... styling for a given node ... inherited tags ... opaque nodes.

Our engine will handle as much (or as little of that) as seems necessary. (or more likely CSS will)

What are you trying to output?

Eventually HTML, but first a series of calls that conform to the Highlight.js emitter protocol. https://github.com/highlightjs/highlight.js/blob/aa3d329b4fac3c7c04fb7ed51300a1c795295d9a/docs/mode-reference.rst#__emittokens

There is not really a simple function

Actually to me it seems a bit more a matter of privacy unless I'm missing something. Here is what seems to work really well (inside of the javascript package): a combo of tagHighlighter and styleTags (jsHighlight):

import {tags, tagHighlighter} from "@lezer/highlight"

const hljsScoper = tagHighlighter([
  {tag: tags.string, scope: "string"},
  {tag: tags.variableName, scope: "variable"},
])

const getScope = (type) => {
  let high = jsHighlight(type)
  if (high) {
    let rule = high[1]
    let res = hljsScoper.style(rule.tags)
    return res
  }
}

My problem is that once I'm on the "outside" that jsHighlight isn't exported... could it be? Or is there some way to reach into the configured deserialized parser and pull it back out?

I'm also happy to write the code by hand (now that I think I understand the structure of things), but I'd need access to const ruleNodeProp so I know which keys to access.

Ideas:

  • Export all styleTags from grammars (high-level)
  • Export ruleNodeProp from @lezer/highlight (low-level)
  • Add some type of registry to the prop system: NodeProp.find_instance("Rule") (low-level)

All I'm wanting (basically) is useful programmatic access to the jsHighlight (type name => tag) tables... because the "raw" type names are just too hard to work with - I want the next level of abstraction - but still with high fidelity. (the CSS highlighting stuff is far too low fidelity).

Eventually HTML, but first a series of calls that conform to the Highlight.js emitter protocol.

That seems entirely doable from a highlightTree callback. You could use a simple highlighter that assigns predictable classes, and use those as scopes to pass to the emitter. Or am I missing something?

putStyle sounds like a pretty strange API (for my use case) and like I'd need to do lots of extra work for nesting, etc... The API our emitter supports is:

  • text
  • scope
  • endscope

IE, the calls might look something like:

`hello ${name}`
scope "string.template"
  text "`hello "
  scope "interpolate"
    text "${"
    scope "variable"
      text "name"
    endscope
    text "}"
  endscope
  text "`"
endscope

IE, a tree walker firing open (and close) events would be perfect... The emitter and output engine want to deal with scoping/nesting of scopes in their own way... we don't want to delegate that. highlightTree doesn't sound like that at all from first glance... And I worry it's already mixing classes is who knows what ways since it says "space separate string of classes"... all I really want is to walk the tree and get the associated tags for each node (at a low-level).

Am I missing something?

On my end I'm not working with a string buffer so positions in the text make no sense... I just need real-time events as the tree is walked.

I personally feel it's a mistake to think of the styleTags are purely for styling rather than for "classification"... The docs seem to acknowledge this by calling them "syntax tags":

CodeMirror uses a mostly closed vocabulary of syntax tags (as opposed to traditional open string-based systems, which make it hard for highlighting themes to cover all the tokens produced by the various languages).

To me it feels like tagging syntax is NOT purely related to highlighter per say... one could do all sorts of things with proper syntax tags (code analysis, etc)...

Highlighting is just a single use case of classification. Feels like there should be some way to get at this information a bit more directly. And to be clear classification is what we'd love to consider using Lezer for, because it's hybrid parsing engine seems pretty incredible at that (far better than our rough regex matching). Yes, ultimately for highlighting - but the classification comes first.

Right now it seems I could:

Now that I understand (most?) of the moving pieces that's not a ton of effort, but the hoops I have to jump to just to be able to query/match the tag information feels really rough. I feel like what would be nice:

node.type.syntax_tags
// or a helper
syntax_tags_for(node.type)

On my end I'm not working with a string buffer so positions in the text make no sense...

That's odd. Positions are all you have in a Lezer tree, and if you don't have access to the document, I don't see how you could emit tokens.

Styled spans are emitted in order, so you can just keep a 'last position' and create unstyled tokens if a token comes in that starts beyond that position.

one could do all sorts of things with proper syntax tags (code analysis, etc)...

Unfortunately, no, building a tagging system that could do that kind of thing is way out of scope for @lezer/highlight. Lezer does provide a generic way to associate values with node types, and you could build something like that on top of that, but highlighting tags are not remotely expressive enough to do much more than highlighting.

Are you sure you've looked deeply enough into what @lezer/highlight actually is, rather than seeing what you want it to be? I'm not super familiar with highlightjs scopes, but I don't think they map directly to highlight tags.

If you are sure that an accessor from a SyntaxNodeRef to a matched rule (an array of tags, an 'opaque' flag for rules that stop further styling inside the node, and an 'inherit` flag for rules that should affect the styling of child nodes, even when those are styled) is what you need, please experiment with such a thing, and, if it turns out that really works for you, I can add it as an export.

Styled spans are emitted in order, so you can just keep a 'last position' and create unstyled tokens if a token comes in that starts beyond that position.

Ok, you talked me into giving highlightTree a shot. :-)

It seems all the "depth" information is lost... for example lets take the prior simple example:

`hello ${name}`

With a very simple debugging putStyle:

const putStyle = (from, to, classes) => {
  console.log(from,to,classes)
}

const hljsScoper = tagHighlighter([
  {tag: tags.string, class: "string"},
  {tag: tags.variableName, class: "variable"},
  {tag: tags.special(tags.brace), class: "brace.special"},
  {tag: tags.meta, class: "subst"},
])

Result:

0 1 variable
4 13 string
13 15 brace.special
15 19 variable
19 20 brace.special
20 21 string

It's entirely flattened... I need to know that the string wraps position 4-21... those other things happen inside a string... also despite adding a mapping for "Interpolation" => t.meta (yes, not quite not right) and then a style mapping - that seems to be dropped entirely - I assume because there is no direct (non-descendant) content?

Perhaps if "modes" are configured differently the behavior can be changed, but this is what I got with just a simple test.

Is it not possible to inject extra styles/props at runtime? I'm trying:

export const extraStyles = styleTags({
  "Interpolation": tags.meta,
})

const parser = coreparser.configure({props: [jsHighlight, extraStyles]})
data = "a = `this is ${test}`  "
let tree = parser.parse(data)

But the extra props don't seem to be used during subsequent parsing.

It's entirely flattened...

It is. This is a highlighting algorithm emitting a flat list of tokens, not a nested hierarchy. Again, you may be seeing something in highlight tags that they are not. They just mark a given node's content as needing to be styled in a given way, excluding child nodes that have their own styling information.

Is it not possible to inject extra styles/props at runtime?

It is possible, and I don't see anything wrong about the code you pasted (assuming you used the parser produced by the call to configure to parse the document).

It is possible, and I don't see anything wrong about the code you pasted (assuming you used the parser produced by the call to configure to parse the document).

I am (added above to clarify). Should the props added in configure replace existing props or supliment them?

update: Ok, figured that out. My scope calc function was still using a single one of the getStyles directly vs a more generic solution.

Ok, here is what I think I'm going to recommend:

// we create some new custom tags
const tag = tags.meta.constructor.define
const interpolationTag = tag()

// supplement the tagging and reconfigure the parser
export const extraStyles = styleTags({
  "Interpolation": interpolTag,
})
let parser = JSparser.configure({props: [extraStyles]})

// use a tagHighlighter for final level of mappings
const hljsScoper = tagHighlighter([
  {tag: tags.string, class: "string"},
  {tag: tags.variableName, class: "variable"},
  {tag: interpolTag, class: "subst"},
])

// sniff for the Rule prop manually
// (of course this could be optimized)
const getScope = (type) => {
  const rule = Object.entries(type.props)
    .find(([_,v]) => v.constructor?.name === "Rule" )
  if (rule) {
    let rule = high[1]
    return hljsScoper.style(rule.tags)
  }
}

// and of course a custom render function that walks the tree
// using a cursor and then talks to our API

So i guess my only request would be if we could export RULE_PROP or whatever you wanted to name it from @lezer/highlight... that's the only really huge hack I see in the above I think.

Well, maybe also [in terms of hacks] that we're extending the tag space, but as long as we were the only one using those new tags I'm not sure how they would interfere with normal operations?

Exporting ruleNodeProp would also require exporting the Rule type. Would a function with a signature like this (which also does context matching) work for you?

export function matchRule(node: SyntaxNodeRef): {tags: readonly Tag[], opaque: boolean, inherit: boolean} | null

So i'd call:

let tagData = matchRule(node)
if (tagData)
  scope = hljsScoper.style(tagData.tags)

Sure, I think that sounds reasonable!

Done in attached patch, except that I went with the name getStyleTags instead.