sweet-js/sweet-core

Syntax and Terms - this isn't even my final form!

Opened this issue · 17 comments

Sweet internals geekery.

In #602 I tried to unify our syntax nesting structures by adding RawSyntax and RawDelimiter to the Term subclasses and changing Syntax to never wrap a delimiter.

This was good because we didn't really need two trees hanging around but the current state of things is still messy and led to bugs like #625.

I think the right solution is to just completely remove Syntax. The only real job it's doing anymore is holding on to the lexical context (scopesets) which can just as easily be done on Terms.

So the proposal is to:

  • remove Syntax
  • add context on all Term objects
  • add addScope and removeScope on Terms
  • add from* methods correctly on Terms
  • change Syntax::resolve to be a standalone function that works over IdentifierExpression and BindingIdentifier terms

Seem reasonable?

Now you're speaking my language! I actually started to ask you about this on Gitter a couple of weeks ago but was too tired to get my thoughts straight.

It would simplify implementation of the default readtable in some places (though I'll have to make some changes to https://github.com/sweet-js/sweet.js/blob/master/src/reader/utils.js#L234-L241) and maybe a few others. But I'm all for it!

Thanks for the invite btw! 😄

(though I'll have to make some changes to https://github.com/sweet-js/sweet.js/blob/master/src/reader/utils.js#L234-L241)

Funny story, I've already done that 😄Was yak shaving on it earlier this week, little more cleaning and I'll push it up for your review.

I actually think it might make sense to extract the reader+readtable to its own repo/package since it's generally useful beyond just Sweet.

I actually think it might make sense to extract the reader+readtable to its own repo/package since it's generally useful beyond just Sweet.

I was holding off suggesting this. It's the reason why I implemented Reader separately from TokenReader in the first place.

Reader, Readtable and CharStream should all be extracted together.

Something that's still nagging me is how TokenReader::locationInfo is separate from CharStream::sourceInfo. I did this on purpose as line and column numbers don't have any "real" meaning in a character stream but it still feels icky.

I would also like to answer the questions I brought up in #626 (comment).

Thinking about it some more, getting rid of Syntax would allow me to:

  1. Pull TokenReader::locationInfo into CharStream
  2. Merge TokenReader::readToken and TokenReader::readUntil into read
  3. Remove TokenReader::context (it's only being used when wrapping the tokens in Syntax)

This effectively removes the need for TokenReader entirely. All that's left are the error functions and they can live in reader/utils.js.

Instead of making those calls to this.readToken in the actions we could just call read recursively, passing in stream (which holds all of the context we need for errors). Getting rid of the context means the signature can be read(source: string | CharStream, ...rest: Array<any>). This is the same signature as Reader::read which implies we could get rid of that too! Then reader.js would just have function exports.

I don't want to make these changes now as it would hold up 3.0 release. But I think I will make a new repo for the readtable/reader/charstream stuff under my account if that's alright with you.

I forgot one thing (so far). Sweet's algorithm for regex literal detection relies on looking at the previous results. This wouldn't necessarily be the case for other projects. So each project that used the readtables would have to implement their own reader function that used Reader.

But I think I will make a new repo for the readtable/reader/charstream stuff under my account if that's alright with you.

Sounds good!

regex literal detection relies on looking at the previous results.

That's what I was yak shaving on in particular. I pushed up the changes I was working on to the refactor/readtable-no-syntax. It's still in a half broken state but I think the regex prefix stuff is working. Those changes are here.

Could we make the reader put Token | List<Token> into the prefix and actually return a SyntaxTerm or whatever type the subclass wants? That way the regex stuff wouldn't need to change.

Thinking about Term/Syntax unification more I've become less happy about having methods like is* and from* on those objects. I think it causes more breakage for not much value. The methods were necessary previously because we didn't have a good way to get the functionality into a macro. Now that modules work better maybe we should remove all those methods and move to just module imports.

A strawman API example that is not well thought out:

import matchSyntax from 'sweet.js/match-syntax' for syntax;
import makeSyntax from 'sweet.js/make-syntax' for syntax;

syntax m = ctx => {
  let param = ctx.next().value;
  if (matchSyntax('NumericLiteral', param)) {
    return makeSyntax('identifier', 'foo', param);
  }
  return makeSyntax('null', param);
}
m 1

I'm tempted to hold off on releasing 3.0 to include this merging plus API change otherwise we'd just have a 4.0 right on its heals that breaks everything again.

I'm a fan. This would play really well w/ #516 (comment).

I'm tempted to hold off on releasing 3.0 to include this merging plus API change otherwise we'd just have a 4.0 right on its heals that breaks everything again.

Agreed. I think we should revisit the list at #485 (comment) to see what might have to break before we get there. A lot of those items could be minor version releases if we're careful.

Could we make the reader put Token | List into the prefix and actually return a SyntaxTerm or whatever type the subclass wants? That way the regex stuff wouldn't need to change.

I was thinking that readToken would function just as it currently does except that it doesn't wrap results from Reader::read. Then we could do a later pass to wrap Token | List<Token> (wrapInTerms in macro-context).

The downside of that approach is you need the additional wrapping pass. But if there's not a clean way to separate the tokenization from the wrapping we can go with that.

I think we might be talking past each other. What regex stuff are you thinking would not need changing if Token | List<Token> were put into the prefix?

I can easily have readUntil not wrap for the prefix and then wrap the returned results. I was looking to move it to reader/utils anyway b/c it's used so infrequently. I think having it as part of the public API may have been a mistake.

What regex stuff are you thinking would not need changing if Token | List<Token> were put into the prefix?

Oh, all the regex prefix stuff would need to change. Currently the prefix is a List<Syntax> but if we go with dropping Syntax in favor of Term then it needs to change anyway. That's what I did in the branch I pushed up the other day, it now works over the recursive type type TokenTree = Token | List<TokenTree>.

What I was trying to say is that regardless of the type that the reader ultimately returns from read (currently Syntax, soon Term, eventually maybe something else), the internals (like the regex prefix) should always work only on a TokenTree.

A separate question is how a TokenTree is wrapped into a Term.

One approach is a readtable that holds two lists, one a list of Term that is pushed to every time a token is read (wrapping the token immediately into a term) and the other a list of TokenTree for the prefix.

The other approach is a readtable that only produces a List<TokenTree>, which is then passed to wrapInTerms after reading has completed.

The second approach is simplest but requires two passes (one to read, one to wrap).

I might not be understanding something though so let me know if this makes sense.

I don't want that second pass either. I tried a couple of weeks ago to use Immutable.Seq until I realized that they don't autocache like Clojure's lazy-seq.

How about I check out the no syntax branch and try to get it working using your first option?

Sounds good!

The first option isn't going to work unless we let 'tokens.js' know about terms. The prefix is often going to have lists of terms in it (recursive descent). isParens and friends will break.

Actually, we have to do this anyway if those predicates are going to be useful outside of the reader.

That's overstating things a bit. isParens et al. just have to know that the list contains objects w/ a value property that is a Token.

There's a PR in #634