keean/zenscript

Compiler Implementation

Opened this issue · 492 comments

keean commented

Discussion about the ongoing implementation of the compiler.

keean commented

The parser now has an implementation of the indentation aware parser described in this paper: https://pdfs.semanticscholar.org/cd8e/5faaa60dfa946dd8a79a5917fe52b4bd0346.pdf

Here's the implementation of the indentation parser:

    function IndentationParser(init) {
        this.indent = init
    }
    IndentationParser.prototype.get = function() {
        return this.indent
    }
    IndentationParser.prototype.set = function(i) {
        this.indent = i
    }
    IndentationParser.prototype.relative = function(relation) {
        var self = this
        return Parsimmon.custom((success, failure) => {
            return (stream, i) => {
                var j = 0
                while (stream.charAt(i + j) == ' ') {
                    j = j + 1
                }
                if (relation.op(j, self.indent)) {
                    self.indent = j
                    return success(i + j, j)
                } else {
                    return failure(i, 'indentation error: ' + j + relation.err + self.indent)
                }
            }
        })
    }
    IndentationParser.prototype.absolute = function(target) {
        var self = this
        return Parsimmon.custom((success, failure) => {
            return (stream, i) => {
                var j = 0
                while (stream.charAt(i + j) == ' ') {
                    j = j + 1
                }
                if (j == target) {
                    self.indent = j
                    return success(i + j, target)
                } else {
                    return failure(i, 'indentation error: ' + j + ' does not equal ' + target)
                }
            }
        })
    }
    IndentationParser.prototype.eq  = {op: (x, y) => {return x == y}, err: ' does not equal '}
    IndentationParser.prototype.ge  = {op: (x, y) => {return x >= y}, err: ' is not equal or greater than '}
    IndentationParser.prototype.gt  = {op: (x, y) => {return x > y}, err: ' is not greater than '}
    IndentationParser.prototype.any = {op: (x, y) => {return true}, err: ' cannot fail '}

This is what a parser using these new parser combinators looks like:

    block = Parsimmon.succeed({}).chain(() => {
        var indent = Indent.get()
        return Parsimmon.seqMap(
            Indent.relative(Indent.gt).then(statement),
            (cr.then(Indent.relative(Indent.eq)).then(statement)).many(),
            (first, blk) => {
                blk.unshift(first)
                Indent.set(indent)
                return {'blk' : blk}
            }
        )
    })

This parses a block of statements, the first line of the block must be more indented than the previous line, and the remaining lines must be indented the same amount as the first line.

@keean I will catch up with you later on the parser combinator implementation. I haven't employed them ever, so I will need to dedicate some time to that. My first priority is to write the grammar into an EBNF file and check that it is conflict-free, LL(k), and hopefully also context-free. I read that parser combinators can't check those attributes.

Also I will want to understand whether using a monadic parser combinator library, forces our AST into a monadic structure and whether that is the ideal way for us to implement. Any way, you are rolling on implementation, so I don't want to discourage you at all. I will try to rally around one way and help code. I will need to study. My focus so far has been on nailing down the syntax and early design decisions. Btw, congrats on getting rolling so quickly on the implementation!

Btw, I hate semicolons. Any particular reason you feel you need to litter the code with them? There are only a very few ASI gotchas in JavaScript (and these I think can be checked with jslint) with not including semicolons and these are easy to memorize, such as not having the rest of the line blank after a return as this will return undefined.

Also I prefer the style of this latest code compared to what I saw before, because I don't like trying cram too many operations on one LOC. It makes it difficult to read the code IMO.

Also, I think I would prefer to employ arrow functions as follows (we'll be porting to self-hosted later so we'll have arrow functions as standard to any ES version and to compromise at 3 spaces indentation (even though I prefer 2 spaces lately):

block = Parsimmon.succeed({}).chain(() => {
   var indent = Indent.get()
   return Parsimmon.seqMap(
      Indent.relative(Indent.gt).then(statement),
      (cr.then(Indent.relative(Indent.eq)).then(statement)).many(),
      (first, blk) => {
         blk.unshift(first)
         Indent.set(indent)
         return {'blk' : blk}
      } 
  )
})

Also I would horizontally align as follows because I love pretty code, which is easier to read:

IndentationParser.prototype.eq  = {op: eq(x, y) => {return x == y}, err: ' does not equal '              }
IndentationParser.prototype.ge  = {op: ge(x, y) => {return x >= y}, err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = {op: gt(x, y) => {return x >  y}, err: ' is not greater than '         }
IndentationParser.prototype.any = {op: gt(x, y) => {return true  }, err: ' cannot fail '                 }

I may prefer:

IndentationParser.prototype.eq  = { op: eq(x, y) => {return x == y},
                                   err: ' does not equal '              }
IndentationParser.prototype.ge  = { op: ge(x, y) => {return x >= y},
                                   err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = { op: gt(x, y) => {return x >  y},
                                   err: ' is not greater than '         }
IndentationParser.prototype.any = { op: gt(x, y) => {return true  },
                                   err: ' cannot fail '                 }

Above you are implicitly making the argument again that we should have the ability to name inline functions (without let) in our programming language. Note this would be an alternative solution to the ugly syntax for the case where we need to specify the return type, but afaics we can't unify around (x, y) => x == y without the prefixed name unless we don't use parenthesis for anonymous product (tuple) types and remain LL(k). Any idea how ES6 is parsing their arrow functions? LR grammar? Alternatively you would be writing that in our language:

let eq(x, y) => x == y
let ge(x, y) => x >= y
let gt(x, y) => x >  y
let gt(x, y) => true
IndentationParser.prototype.eq  = {op: eq, err: ' does not equal '              }
IndentationParser.prototype.ge  = {op: ge, err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = {op: gt, err: ' is not greater than '         }
IndentationParser.prototype.any = {op: gt, err: ' cannot fail '                 }

Which would have helped you catch the error on the duplication of the gt name copy+paste typo. The only reason you are adding the redundant naming above is for browser debugging stack traces correct?

Or (unless we change the syntax):

IndentationParser.prototype.eq  = { op: x y => x == y,
                                   err: ' does not equal '              }
IndentationParser.prototype.ge  = { op: x y => x >= y,
                                   err: ' is not equal or greater than '}
IndentationParser.prototype.gt  = { op: x y => x >  y,
                                   err: ' is not greater than '         }
IndentationParser.prototype.any = { op: x y => true,
                                   err: ' cannot fail '                 }
keean commented

The main reason to use function is backwards compatibility, not all browsers support => yet.

With regards to our syntax, function definition should be an expression, so you should be able to include it inline in the object declaration. I think we would end up with something like this:

data Relation = Relation { op : (A, A) : Bool, err : String }

let eq = Relation { op: eq(x, y) => x == y, err: ' does not equal ' }

@keean wrote:

The main reason to use function is backwards compatibility, not all browsers support => yet.

I know. That is why I wrote:

Also, I think I would prefer to employ arrow functions as follows (we'll be porting to self-hosted later so we'll have arrow functions as standard to any ES version

I had already explained we will get backwards compatibility for free, and by not putting function we are more compatible with the way it will be written in our language when we port over.

Who can't run our compiler in a modern browser in the meantime? This is only alpha.

Please re-read my prior comment, as I added much to the end of it.

keean commented

Regarding semi-colons, Douglas Crockford in "JavaScript: The Good Parts" recommends always using semi-colons explicitly because JavaScripts semi-colon insertion can result in the code not doing what you intended.

keean commented

I think you are right about '=>' for functions, as it is running in Node which supports them, however, I don't think porting will be that straightforward, as we won't directly support prototypes etc.

@keean wrote:

because JavaScripts semi-colon insertion can result in the code not doing what you intended.

Did you not read what I wrote?

There are only a very few ASI gotchas in JavaScript (and these I think can be checked with jslint) with not including semicolons and these are easy to memorize, such as not having the rest of the line blank after a return as this will returnundefined.

http://benalman.com/news/2013/01/advice-javascript-semicolon-haters/

keean commented

Regarding semi-colons:

... the specification is clear about this. JavaScript’s syntactic grammar specifies that semi-colons are required. Omitting semi-colons results in invalid JavaScript code. That code won’t throw (thanks to ASI), but it’s still invalid.

Semicolons won't help you here:

return
   some long shit;

You have to know the rules, whether you use semicolons or not. That is why I am happy we are going to use a Python style indenting.

Semicolons are training wheels that don't protect against every failure.

keean commented

Also jshint wants you to put them in, and I am using jshint as part of the build process.

jshint catches the above error :-)

JSHint can be configured to allow ASI. And I think it will still warn you about ambiguous implicit cases, if I am not mistaken (it should).

keean commented

without semi-colons JSHint cannot recognise the above error because you might mean:

return;
some long stuff

or

return some long stuff;

Bottom line is you have something at the start of the line which could possibly be a line continuation, then check to make sure you have made it unambiguous.

That is the simple golden rule and it applies whether using semicolons or not. That is not complicated. One simple rule.

keean commented

JavaScript was never designed to be used without semi-colons... lets design our new language not to require them, but I don't see any point in fighting JavaScript... We will emit the semi colons into JS :-)

@keean wrote:

without semi-colons JSHint cannot recognise the above error because you might mean:

It should be warning that the case is ambiguous. I can't be faulted for the JSHint programmers being derelict (if they are, did not confirm).

@keean wrote:

JavaScript was never designed to be used without semi-colons...

The intent isn't relevant. What is, is what is relevant. We need to know the rules whether we use semicolons or not. We are not stupid programmers who need to make ourselves feel we are more secure by not knowing the rules. I lost my training wheels 30 years ago.

Otherwise we need to find a linter that warns of all ambiguous cases with or without semicolons.

Bottom line is you have something at the start of the line which could possibly be a line continuation, then check to make sure you have made it unambiguous.

That is the simple golden rule and it applies whether using semicolons or not. That is not complicated. One simple rule.

If JSHint isn't doing that checking, then it is derelict. Need to find a better linter.

Wonder if Douglas Crockford ever considered that. Some influential people decide that semicolons every where is the prescribed way, then why the heck did JS offer ASI any way?

Perhaps he could have realized that the only sure way, is to have a linter which properly warns of every ambiguous case, whether using semicolons or not. Instead perhaps these talking heads influenced the JSHint people to not add proper checking for the ASI case? Sigh.

keean commented

So here's what the guy that created JS thinks: https://brendaneich.com/2012/04/the-infernal-semicolon/

It doesn't matter. It is just logic.

There you go. Cockford doesn't agree to support ASI in his tool and thus promulgates that ASI is an error:

Some argue that JSMin has a bug. Doug Crockford does not want to change JSMin, and that’s his choice.

That's right:

And (my point here), neither is newline.

Know the rules. Newline is not a statement nor expression terminator in JavaScript. Simple as that. Resolve all ambiguous cases.

Analogous superfluous redundancy as one wouldn't write ;;;;;;; at the end of every statement or expression to make sure they got it right. They also don't need to write ; to make sure they got it right, if they are using a linter which can warn them whether the preceding expression on the prior line could be joined to the next line and thus that a semicolon or other syntax needs to be inserted to resolve the ambiguity.

keean commented

The moral of this story: ASI is (formally speaking) a syntactic error correction procedure. If you start to code as if it were a universal significant-newline rule, you will get into trouble. A classic example from ECMA-262:

keean commented

So I don't write code with syntactic errors... I write Python without I write C++ with... it doesn't bother me, I go with what the language standard says...

The moral of this story: ASI is (formally speaking) a syntactic error correction procedure. If you start to code as if it were a universal significant-newline rule, you will get into trouble. A classic example from ECMA-262:

Then why did he put it in JavaScript. Linters should do their job correctly.

There is absolutely no reason you ever need a ; after a block { }. The } terminates the statement or expression.

keean commented

I wish I had made newlines more significant in JS back in those ten days in May, 1995. Then instead of ASI, we would be cursing the need to use infix operators at the ends of continued lines, or perhaps \ or brute-force parentheses, to force continuation onto a successive line. But that ship sailed almost 17 years ago.

I wish I had made newlines more significant in JS back in those ten days in May, 1995. Then instead of ASI, we would be cursing the need to use infix operators at the ends of continued lines, or perhaps \ or brute-force parentheses, to force continuation onto a successive line. But that ship sailed almost 17 years ago.

There you go. JS can't require semicolons. So why do you? Probably because we can't use a proper (non-derelict) linter, probably because JSHint probably doesn't warn of all ambiguities with 'ASI' enabled (but I didn't confirm that).

We are moving to block indenting to avoid this entire mess.

keean commented

Okay, so conclusion, I will use '=>' for anonymous functions, but leave the ';' in for now...

Our language won't require semi-colons, just like Python does not...

I go with what the language standard says...

The language standard says ASI is a supported feature. You have to know the rules whether you use semicolons or not. I will not repeat this again.

Let's try to find a linter which isn't brain-dead.

keean commented

My two cents: be careful not to use ASI as if it gave JS significant newlines. And please don’t abuse && and || where the mighty if statement serves better.

The standard says it is a syntax error to omit the semi-colon.

The standard says it is a syntax error to omit the semi-colon.

Then why does it compile. Brendan told you that JS can not require semicolons every where because it breaks other things.

keean commented

You are right. If you really can't work with the semi-colons, I will get rid of them for this project.

keean commented

Can I delete the semi-colon discussion, as its cluttering the implementation thread... I am going to remove them.

I discovered this does not work inside => defined functions... that is a bit weird.

I don't understand why deleting my disagreement helps. You have your freedom to do it your way and I have my freedom to speak my logical and engineering disagreement.

ASI is a feature of JavaScript that can't be removed for the reason Brendan (its creator) explained. Ostensibly Douglas Cockford and other (corrupt?) people in the standards process get together and decide that they are too lazy to write tools that fully support the language (including apparently JSHint's derelict failure to offer even a flag to report ambiguous cases of ASI?), so they decide to decree that ASI is a syntax error when in fact it is required by the language and does in fact not generate a syntax error. A programmer could accidentally insert an ASI case, and the language will not generate a syntax error. It is entirely derelict and a basterdization of the language. Brendan even agrees with me if you read carefully what he is saying. He has no choice but to go along with the community of derelictness, because it is a political game. Look what happened to Brendan recently because of his personal politics. Corrupt world we live in and I will not be a supporting member of the corruption by following illogical decrees and basterdization of what is. Note I can't accuse any person or organization of corruption, because I don't know that to be the case. Can just be human nature and design-by-committee outcomes. ASI is part of JavaScript, regardless of some meaningless words inserted after the fact by some standards committee. And tools that don't support the language fully are derelict.

I am a rebel. And I will remain one. But of course, I will do that is reasonable after registering my disagreement.

Look how successful the drive to not support ASI has been. The world's most popular language apparently doesn't even have an open source linter that is compliant with the language? I am an engineer, not a politician.

I register my disagreement with cow-tailing to derelict refusal to adhere to the features of the language. My point to them is remove ASI from the language if they don't want tools to fully support it. Instead of using deception and politics to influence a lack of support from tools.

Proofs by appeal to authority are less convincing to me than proofs of engineering facts.

keean commented

Doesn't JSHint work? Seems to work for me with 'ASI' enabled.

@keean wrote:

Doesn't JSHint work? Seems to work for me with 'ASI' enabled.

I don't know. Is it warning of all ambiguous cases with 'ASI' enabled? I haven't checked. I assumed not, because you were so resistant.

I just feel we are expert programmers and if our linter informs us of all ambiguous cases, then I see no logical reasons to be forced to insert semicolons. Perhaps one could make an argument that without a linter, then semicolons are safer and in a large organization maybe they make that decision. We are not a large organization. By the time we get to having many people contributing, hopefully we will have ported to self-hosted to get entirely away from this ASI issue.

I sympathize with Brendan, he has to politically navigate a language design with some flaws.

Now if JSHint is not working correctly with 'ASI' enabled in terms of warning of all gotchas, one can make a case that we should just acquiesce. Seems many others have. But what bothers me is the antagonists (protagonists for deprecating ASI) don't argue it that way ("hey we made derelict tools, so you have no choice" would be more honest). They use an incorrect logic which fools the readers.

Sorry I didn't have the energy to explain that earlier. I needed to go cook and eat. So now I have the energy to explain my side. Maybe I will acquiese if there is no good linter, but I will still register my frustration with basterdization by influential people. (note I am not knowledgeable about how it played out over the years, I am just guessing)

Of course I can work with semicolons if we must, I am just unhappy with decrees which are not correct from an engineering perspective.

@keean wrote:

So I don't write code with syntactic errors

ASI is not a syntax error in JavaScript, regardless of some decree. If it were a syntax error, it would refuse to compile and run. I know the distinction between 'error' and 'warning', regardless of some "deception" (for lack of a better word) in an official document.

I discovered this does not work inside => defined functions... that is a bit weird.

=> is not the same as function. => doesn't create a lexical scope, this works differently because of it, I could elaborate if you want, but googling fat arrow syntax for JavaScript should explain it.

I hope that is more fair. What I am saying is that if JSHint is not derelict, then we could (if we choose to) get rid of superfluous semicolons. But if we have no good linter that can check ASI ambiguities for us, then I guess we have to acquiesce and use superfluous semicolons per tools support. I am just a teeny bit miffed about appeals to authority, as the rationalization.

@SimonMeskens wrote:

I discovered this does not work inside => defined functions... that is a bit weird.

=> is not the same as function. => doesn't create a lexical scope, this works differently because of it, I could elaborate if you want, but googling fat arrow syntax for JavaScript should explain it.

Question is do we need also the distinction in our language?

I presume not having a lexical scope means all references constructed are accessible in the outer lexical scope. When do we need this? It seems to be the antithesis of what a function is supposed to be.

Thus maybe @keean should be using function and not fat arrow, if our language will always create a lexical scope for all functions.

Good point. Btw, I knew about the lack of this, but had not paid attention to the lack of a lexical scope.

Edit June 4, 2017: arrow functions have closures over lexical scope. The only difference is this is scoped lexically, which basically means they should not be used for methods that will be called with the . syntax.

If we keep it, we shouldn't keep the concept of this as it is in JavaScript today, that's for sure. JavaScript combines multiple concepts and gives them a confusing name (this in JavaScript is NOT self, like in about every other language featuring the this keyword, but it sorta acts like self in common cases, this is highly confusing).

With type classes, you don't need this at all I think, I vote we pull it out completely and prefer a system of type classes, with maybe an extension method (through partial application) operator. For example:

function x(a: A, b: B) {
  return a + b
}

All functions can be partially applied through some syntax (or maybe all functions get curried by default?). Extension method syntax would then simply be syntactic sugar for partial application.

Something like this:

x(a)(b)

is equivalent to:

a::x(b)

is equivalent to:

a::b::x()

Now, the first syntax is just regular old partial application through currying, the second syntax is just shorthand for partial application, but looks conveniently like classic OO and the third example uses that syntax in a more exotic fashion, that is mainly a curiosity, but might be useful since it allows you to write in some sort of postfix notation.

In reality, the third example would probably look more like this:

let x1 = a::x // Meaning, apply a to x as the first parameter

... some more code

let x2 = b::x1 // Apply b to x1 as the second parameter of x

... some more code

x2() // execute the pre-bound function

This system completely foregoes any need for function context or self and thus conveniently scraps the confusing this from the language.

@SimonMeskes wrote:

...we shouldn't keep the concept of this as it is in JavaScript today, that's for sure.

We are not.

With type classes, you don't need this at all I think

If we discard this, we can't call methods of the typeclass with familiar dot syntax. I proposed static for methods which are supposed to be independent of any instance selected the implementation.

prefer a system of type classes, with maybe an extension method (through partial application) operator

I don't comprehend how you think extension methods apply to typeclasses. In my mind, Extension methods are a different concept and not related.

@keean I don't think we need functions whose bodies are not lexically scoped. If you agree, I suggest we never emit ES5 fat arrow functions. See edit. Thus you should be using function as you were, to support compatible semantics.

keean commented

@shelby3 I agree, we emit function.

I think we can use . notation for scoping and modules, we don`t need it for type class functions and constructors.

When you have multiple dispatch on typeclasses, having one object in front of the function does seem necessary.

I think I prefer:

let x3 = add(x2, x1)

To

let x3 = x2.add(x1)

Because we would select the overloaded type-class function to run based on the type of both arguments, not just the first.

And we would still have modules so you might have:

let x3 = math.add(x2, x1)

because add is defined in the math module. However we should have flexible importing that lets you rename and bring a modules functions into scope without a prefix if wanted.

@shelby3 I showed you how partial application IS extension methods and how a prefix partial application syntax corresponds exactly to dot notation and how that allows us to completely remove this altogether, which part isn't clear, so I can elaborate? The only reason I picked the arbitrary :: as the partial application syntax, is because I don't think it's useful to overload dot notation even more, but it's completely equivalent.

@keean I don't think the most frequent case will be implementing typeclasses by pairs of data types. I think single data type implementations will be much more common, and Eq and Add will not work with pairs of types, but rather the same type for both arguments. We want to avoid implicit conversions, as that appears to be a design error in many languages such as C and JavaScript. Please correct me if you know or think otherwise.

As I had argued, it doesn't seem to make any sense to a programmer coming from OOP, that they've specified an interface bound on their function argument, and they are not suppose to call the methods using dot syntax. I think it is breaking the way mainstream (Java, C++, JavaScript, PHP, etc) people naturally think about it. I explained that static makes it explicit when this doesn't apply. I much prefer explicitness in this case, as it is not any more verbose.

I prefer to keep out typeclass concept as familiar as possible to the way people already think, especially given there are no downsides to doing so.

@SimonMeskens wrote:

I showed you how partial application IS extension methods and how a prefix partial application syntax corresponds exactly to dot notation and how that allows us to completely remove this altogether, which part isn't clear, so I can elaborate?

Oh yeah Kotlin and I realized too 5 years ago that extension methods are equivalent to partial application over a function passing the this, so of course I agree.

But I don't see how that is applicable to typeclasses?

This system completely foregoes any need for function context or self and thus conveniently scraps the confusing this from the language.

Afaik, there is not confusion in Java about this. Afaics, JavaScript's problems with this (being a prototype language, not strictly OOP) are not related to my proposal for using this for typeclasses.

If we discard this, we can't call methods of the typeclass with familiar dot syntax.

I was replying to that comment. Basically, this is how this (function context) works in JavaScript:

function x(this: A, b: B) {
  return this + b
}

a.x(b);

TypeScript lets you make this explicit, but this is just an extra parameter to the function containing the call-site variable.

What I'm saying is that this is equivalent to:

function x(a: A, b: B) {
  return a + b
}

a::x(b);

When we assume :: is partial application, right? So what if we make it so dot notation is partial application, like in C#? We replace :: with . and the code looks like this:

function x(a: A, b: B) {
  return a + b
}

a.x(b);

This is exactly the familiar dot notation. The only difference is that I say: if the function has a call-site, partially apply that call-site as the first parameter. What you say is: if the function has a call-site, store it in an implicit this variable. I'm just saying that the second example is, in my opinion, more clear, because it avoids using the word this for call-site. If you disagree with me, I highly recommend you change the name to context instead, so there's no confusion with this in Java or C#, where it means closure of self, a completely different concept, that confusingly, works similarly for common cases. This trips up lots of programmers today in JavaScript.

Also, here's an example of the same code in C#:

static x(this A a, B b) {
  return a + b;
}

a.x(b);

Basically, they add the this keyword in front of the first variable to make it explicit to the programmer that the first variable might be call-site and only allow you to call functions marked like this with dot notation.

@shelby3 wrote:

@keean I will catch up with you later on the parser combinator implementation. I haven't employed them ever, so I will need to dedicate some time to that. My first priority is to write the grammar into an EBNF file and check that it is conflict-free, LL(k), and hopefully also context-free. I read that parser combinators can't check those attributes.

I am working on the LL(k) EBNF grammar now, and I think you will have difficulty getting this correct with parser combinators, or at least not without the guidance of a checked EBNF grammar. I already found conflicts (ambiguities in the grammar) that would not have occurred to me without the SLK check. For example, we can't put a function definition declaration inside an if nor else expression. I will publish asap (hope today), so you can see what I am referring to and so you may offer your reaction and/or corrections.

refresh prior comment

keean commented

@shelby3 That's why I used the formal indentation parsing PEG from the paper I linked to. Ad-hoc approaches are likely to be error prone. With these combinators you can introduce a parser with an absolute indent, or a relative indent. A relative indent can be equal, greater-equal, or greater than the previous line's indent. This easily lets us define a correct parser for a statement block.

keean commented

@shelby3 So if we decide to keep dot notation, that stops us having type-class methods that do not take the 'base type' as an argument. For example lets say we want a type-class to declare the + operator. I would do something like:

typeclass Add<A>:
    (+): (x : A, y : A) : A

implement Add<Int>:
    (+) : (x, y) => primitive_add_int(x, y)
keean commented

I now have a compiler front-end that links the parser and generator. Here's an example of it in action:

Input file "t1.zs"

let id = id(x) => x
id(42)

Run command:

node src/compiler.js t1.zs

Output file "t1.js"

id=function id(x){return x;};id(42);

It is still using the earlier function syntax though.

It also dumps the AST into a file "t1.ast" for debugging:

{
  "status": true,
  "value": {
    "blk": [
      {
        "ass": "id",
        "exp": {
          "fn": "id",
          "args": [
            "x"
          ],
          "body": {
            "rtn": {
              "var": "x"
            }
          }
        }
      },
      {
        "app": "id",
        "args": [
          {
            "lit": 42
          }
        ]
      },
      {
        "eof": ""
      }
    ]
  }
}
keean commented

Usage instructions for prototype compiler here: #20

I think better than using parser combinators, would be to simply create data structures and parsers based on the grammar I have written. These data structures would then comprise the AST. We can then write any code we want to process the AST for any reason, such as transpiling, compiling, macros, etc..

https://en.wikipedia.org/wiki/LL_parser#Concrete_example

I believe the parser combinators are less maintainable, can't do the separation of concerns and special interactions with an orthogonal lexer, have serious performance deficiencies, and are highly obtuse as compared the alternative above, which would have a 1-to-1 mapping with the coherent grammar file.

keean commented

@shelby3 wrote:

I believe the parser generators are less maintainable

This is not my experience having worked on several reasonably complex parsers commercially including HTML5 and JavaScript parsers in the past. Parser generators never work because you end up having to hand modify the code they output, and the output code is normally of poor quality. Writing your own recursive descent parser quickly becomes unmaintainable and very difficult to make changes to. On a couple of projects in the past I have rewritten the parsers from hand-written recursive descent to parser combinators resulting in shorter and more readable code, that you can actually make changes to without messing up the whole grammar (modularity is a very important feature, and yes, modularity does require backtracking. In my experience the loss in performance is minor, compared to the gain in the ability to change and add new parsers).

These data structures would then comprise the AST. We can then write any code we want to process the AST for any reason, such as transpiling, compiling, macros, etc..

Again not my experience, once you introduce object hierarchies and inheritance into the AST it adds a large amount of boilerplate into the code. For example take a look at my code generator for JavaScript:

function gen_fn(ast) {
    return 'function ' + ast.fn + '(' + ast.args.join(',') + '){' + gen_blk(ast.body) + '}'
}

function gen_exp(ast) {
    if (ast.lit) {
        return ast.lit.toString()
    } else if (ast.var) {
        return ast.var
    } else if (ast.app) {
        return ast.app + '(' + ast.args.map(gen_exp).join(',') + ')'
    } else if (ast.fn !== undefined) {
        return gen_fn(ast)
    } else {
        return ''
    }
}

function gen_stmt(ast) {
    if (ast.fn) {
        return gen_fn(ast)
    } else if (ast.decl) {
        return 'var ' + ast.decl + '=' + gen_exp(ast.exp) + ';'
    } else if (ast.ass) {
        return ast.ass + '=' + gen_exp(ast.exp) + ';'
    } else if (ast.rtn) {
       return 'return ' + gen_exp(ast.rtn) + ';'
    } else if (ast.app) {
        return ast.app + '(' + ast.args.map(gen_exp).join(',') + ');'
    } else {
        var exp = gen_exp(ast)
        if (exp !== '') {
            exp += ';'
        }
        return exp
    }
}

function gen_blk(ast) {
    if (ast.blk) {
        return ast.blk.map(gen_stmt).join('')
    } else {
        return gen_stmt(ast)
    }
}

var exports = function generate(ast) {
    return gen_blk(ast)
}

Have you any idea of the amount of boilerplate the object-oriented approach would generate? And it gets worse when you want to do subtree-matching and tree-rewriting for the intermediate pass stages of the compiler. I have treated the JS objects as a tagged union, as we would write data Ast = Blk {...} | ... and we can then simply match on the tag.

separation of concerns and special interactions with an orthogonal lexer

In my experience the this rapidly turns into unmaintainable spaghetti where changes have unintended consequences to the parsing that make the whole thing ossify and create huge amounts of boilerplate.

serious performance deficiencies

This is not backed up by fact. Parsimmon is the second fastest JavaScript parser in the tests I posted a link to. None of the EBNF parser generators produced faster code, and the fastest library lacked any of the simplicity of parser composition.

Parser combinators are not obtuse, they are based on parsers that can be composed. Composition is one of the most desirable properties for code to be modular, readable, and to allow you to reason locally. Unfamiliarity is not the same as obtuseness :-)

keean commented

On the other hand, I think it is very useful to have the grammar written down in a way we can discuss it and improve it. You are right that it would be hard to do that just looking at the code.

@keean: Can we rewrite the code with a DSL (fluent interface) in such a way that it looks closer to a grammar, so we can reason better about the code?

We could also potentially make the compiler two-step:

  • Generate a CST from a grammar definition using combinators
  • Turn CST into AST

I think this might have the same simplicity and good performance characteristics. The advantage is that the CST producing combinators will look very close to grammar and be easier to reason about.

keean commented

@SimonMeskens I am open to any idea that makes the code easier to work with. I have written this sort of stuff before and I know what works for me, but I am open to anything that may be better than the way I am currently doing it.

The problem with the EBNF is that it has no information about how to assemble the information into the AST tree, consider:

type-variables: '<' VID { ',' VID } [ ',' ] '>'       // VID (type variable identifier)

What AST node does this information go into? which parts of the syntax are relevant and which not. We probably want a single list of VID as output but this would output a single VID and then optionally a list of VIDs. This is in fact not a parser at all, but a recogniser. Something has to take the recognised strings and turn them into a data-structure, and this is completely missing. So the actual parser will by necessity look a bit different. Parser combinators allow us to combine recognition with production. This would look something like this using parser-combinators:

var vid = Parsimmon.regex(/[A-Z]+/)
var type-variables = Parsimmon.string('<').then(seqMap(vid, (comma.then(vid)).many(), (first, rest) => {
        rest.unshift(first) // merge single vid with list
        return new TypeVariables(rest) // create AST node
    })).skip(Parsimmon.string(',').atMost(1)).skip(Parsimmon.string('>')

To help read it, for the composition to succeed, all the combinators must succeed, then ignores the output the parser to its left, skip ignores the output of the parser to its right, many expects zero or more of the parser it is applied to and turns its output into a list, atMost list like many but has a limit (so optional means the same as atMost(1), we could define an alias if we like). Finally seqMap is a convenience method that combines a seq followed by a map. seq runs one parser after another, and map is what turns a recogniser into a parser allowing you to convert the matched strings and lists of strings into abstract output (in this case an instance of the TypeVariables class). So this parser takes the EBNF syntax and outputs the AST object directly. It can be combined with other parsers compositionally to build up the complete parser.

So I am completely open to other solutions if they are clearer, more maintainable and don't introduce lots of boilerplate.

@keean wrote:

Parser generators never work because you end up having to hand modify the code they output, and the output code is normally of poor quality.

I'm thinking we write out own code, just using the grammar as a guide. I doubt it is worth writing our own generator, but maybe. We could start with writing our own code for the data types that represent rules of the grammar.

Writing your own recursive descent parser quickly becomes unmaintainable and very difficult to make changes to

I don't understand what makes parser combinators superior to recusive decent?

you can actually make changes to without messing up the whole grammar (modularity is a very important feature

How does changing the code a grammar production mess up the whole grammar? I think it is possible to model the grammar in a way that is modular for each production.

Perhaps I need to start coding to show what I mean and/or to discover that you are correct?

and yes, modularity does require backtracking. In my experience the loss in performance is minor,

Minor when you are backtracking for every OUTDENT?

compared to the gain in the ability to change and add new parsers

What do you mean? You mean change the grammar?

Have you any idea of the amount of boilerplate the object-oriented approach would generate?

No. Perhaps I should write some code so I can find out.

I don't know why you are assuming that I proposed subclassing.

I have treated the JS objects as a tagged union, as we would write data Ast = Blk {...} | ... and we can then simply match on the tag.

Good to see that you see why unions are so important.

In my experience the this rapidly turns into unmaintainable spaghetti where changes have unintended consequences to the parsing that make the whole thing ossify and create huge amounts of boilerplate.

Perhaps because there wasn't a coherent grammar and coherent specification of the interaction.

I can't quickly visualize how anything works with your parser combinator code, at least not yet (maybe it will make more sense to me later). It is all greek to me at this time. Even if I come up to speed on the library you are using, I don't know how abrogating lexing is going to make more comprehensible that which apparently can't be properly conceptualized without lexing.

This is not backed up by fact. Parsimmon is the second fastest JavaScript parser in the tests I posted a link to. None of the EBNF parser generators produced faster code, and the fastest library lacked any of the simplicity of parser composition.

I don't assume (don't know if) those stated tests are necessarily representative. I say you make a parser combinator version and I will make another version. And then let's compare performance. But I have no idea how we will exhaustively test that your parser combinator version is compliant to the grammar. That flaw is for me enough by itself to make it not the "correct" way to do it.

Parser combinators are not obtuse,

Afaik so far, they seem to obfuscate the 1-to-1 mapping to the LL(k) grammar afaics thus far.

they are based on parsers that can be composed.

I am not aware (yet) of what is non-composable about what I have proposed.

Composition is one of the most desirable properties for code to be modular, readable, and to allow you to reason locally.

I have not disagreed with this.

Unfamiliarity is not the same as obtuseness :-)

I am referring to lack of obvious mapping between the grammar and encoding.

I am replying in order of reading your comments, so to the extent that the following addresses any of what I wrote in the prior comment, then it has been noted.

@keean wrote:

var vid = Parsimmon.regex(/[A-Z]+/)

Handcoded lexer would be faster than regular expression matching. I can agree with premature optimization if this was the only factor, but we also have to factor in whether our design has necessary separation-of-concepts so that we could even support a handcoded lexer or whether we lock ourselves into the inertia of an inefficient design.

The problem with the EBNF is that it has no information about how to assemble the information into the AST tree, consider:

type-variables: '<' VID { ',' VID } [ ',' ] '>' // VID (type variable identifier)

What AST node does this information go into?

The TypeVariables type, which is a list of VID.

which parts of the syntax are relevant and which not. We probably want a single list of VID as output

Agreed.

but this would output a single VID and then optionally a list of VIDs.

The TypeVariables could be organized as product type (tuple) of a single VID and optional list. If the optional list was a separate production, then afaics it would require it be organized as a tuple.

This is in fact not a parser at all, but a recogniser. Something has to take the recognised strings and turn them into a data-structure, and this is completely missing.

It is a parser into a data structure that is the rules of the grammar. If you want to transform that data structure into another one, then you run a transformation on it after initial parsing.

Parser combinators allow us to combine recognition with production.

I know you know that combining and/or conflating is the antithesis of modularity.

This would look something like this using parser-combinators:

Conceptually what I am thinking of is very similar to that, except the lexer is unconflated.

I should write some code asap, so I can compare and really dig into the details and try to understand how parser combinators have solved some problem I may encounter in that learning process.

keean commented

@shelby3 wrote:

Perhaps I need to start coding to show what I mean and/or to discover that you are correct?

I don't want to be inflexible, and I am happy to be proved wrong (because being wrong means I have learned something, which makes it all worthwhile), but I have fought some battles here and don't particularly feel like fighting them again. If we put the code side by side then we can chose the most beautiful. This would also need some comparison of changeability, like introducing a new operator, or changing the expression grammar.

What do you mean? You mean change the grammar?

Yes. Language grammars never remain fixed. We will want to add new features or fix annoyances that come from actually trying to use the language. Maintainability is more important than the time it take to write, because the software will be written in finite time, but will effectively be in maintenance the rest of its life.

Minor when you are backtracking for every OUTDENT?

In my own performance testing it made little difference. The reason is the whole file is in memory anyway, and backtracking is just moving a pointer. In a garbage collected language we don't even need to clean up the AST allocations we just drop them for the GC to clean up later. What causes a problem is a recursively ambiguous grammar, where you have a choice point, and then recursively have multiple choice points before the first choice point is resolved. With the OUTDENT you cannot have one outdent backtrack inside another because it is trailing. In other words we have always decided whether the OUTDENT ends the block before we move onto the next line.

Perhaps because there wasn't a coherent grammar and coherent specification of the interaction.

This is part of it, but you don't want to update a formal grammar and re-implement the whole parser just to try out an experimental feature. With test-driven-development the idea is the unit tests cover all the grammar cases, and you modify the code until it passes the unit tests. Yes its great to have a formal specification, but if we want to encourage new contributors it has to be easy to hack on.

I don't know how abrogating lexing is going to make more comprehensible that which apparently can't be properly conceptualised without lexing.

Lexing is not necessary, and makes the whole thing harder to work with. Buy the time we have added multi-line comments and strings which require different tokenisation it all becomes a mess. Just think of parser-combinators doing lazy tokenisation. The terminal parsers like var identifier = Parsimmon.regexp(/[a-z][a-zA-Z_0-9]*/) are the equivalent of tokens.

Its a very functional programming approach - imagine you take the lexer and instead of having it return simple static tokens, you replace it with a bunch of functions that test if the current token is A, B or C. Now depending on the grammar the parser will have certain expectations of what the next valid token is, say either A or B. Now you can simply as the lexer is the next token an 'A' or a 'B'. If you now make the lexer lazy so that it only tests the stream when asked to match a particular token, and only advances the stream if the token matches, and you have a basic idea of how the combinators achieve the same thing as the lexer. It is effectively a transformation of the program in function space.

Afaik so far, they seem to obfuscate the 1-to-1 mapping to the LL(k) grammar afaics thus far.

That is only because you have not dealt with the production of the AST from the recognisers. Again I may be wrong, and generating a big tree of strings directly from the grammar and then reworking the string tree into the AST may be neater. Maybe I have been doing it wrong all this time.

I know you know that combining and/or conflating is the antithesis of modularity.

There are different kinds of modularity. Keeping the parser together is one, keeping all the functions that affect a particular AST node from all different phases of compiling is another.

I am referring to lack of obvious mapping between the grammar and encoding.

I find it easy to read the EBNF and check the parser combinator is doing the same thing, and I am sure this would be the case for anyone that spent a bit of time working with the combinators, so I am not worried about this. I think correctness can be observed by reading the code and the EBNF. Of course there is no guarantee the EBNF is correct, only that it is LL(k)...

I think it would be great to find a new easier way to write parsers, so in some way I hope you are right. On the other hand I definitely find the combinators easier to work with than other recursive descent code I have worked on, but maybe you have a better way.

Is there a reason for writing the compiler in Javascript? Its a really bad choice unless constraints such as having to run the compiler inside a browser force it.

If you can use a command line tool as a parser I can give you one that accepts most EBNF, with user actions in Scheme, and output S-expressions. (Almost) the whole of the Felix language is defined this way, and the grammar is actually in user space in the library (so the parser automaton is actually generated dynamically on the fly, with caching for performance).

Then you can get on with the important bits of the language and worry about writing a parser when the language is strong enough to bootstrap itself.

@skaller I'm not sure why involving a third language would be useful? I doubt scheme is that much more performant than JavaScript, it shouldn't matter much.

I think you missed my point. I already have a parser tool that will accept most EBNF specifications, has user actions written in Scheme, and produces S-expressions. I am suggesting that you could use this tool as it is to translate files consisting of (a) EBNF specs for your grammar, and (b) programs in your language to produce a file containing an AST as text in S-expression format which a couple of lines of Javascript could easily translate into JSON or whatever you need. A bit of fiddling to adapt the code to your purposes may be necessary (eg adding INDENT and DEDENT).

Then, once your language is strong enough to bootstrap itself you can throw out this machinery and define the parser in your language.

In other words, you can skip all the parsing and grammar specification pain initially without this preventing you replacing it later. Yes, there's a cost, you will have to commit to a binary program generated by Ocaml (the parser and Scheme interpreter are also Ocaml code).

The reason I suggest considering this is that you will find you change the grammar a LOT when experimenting with language features and its really good to be able to do that without a lot of pain so you can focus on the experiment, rather than modifying the parser. I can do that in Felix, indeed most of the extensions I experiment with require no or only small modifications to the actual compiler.

I am actually doing that at the moment, 20 years down the track, I am writing a parser in Felix to start the process of replacing Ocaml. Good parsers are hard to write. Have a look at the source to Dypgen to see what it takes. You can use a simpler parser, provided you're willing to hand craft the grammar to meet various constraints (such as avoiding left recursion for example).

I am happy to see others contributing.

Well there are other comments I could make regarding certain things but this issue isn't appropriate for them. I'm not sure where is, perhaps the right place can be suggested?

Briefly, I would suggest initial simplifications to get something working. For example, just define integer, float, string, function type, function definition, variable definition, an output procedure, and basic expressions (+, -, *, /) and function application and get it running. No type inference or deduction.
As you develop the language, its easier to make decisions from a base point.

I personally feel type inference is a bad idea. Another discussion.

I would also suggest things like: don't try to do the desired late binding yet. Keep it in mind. But implement the easier and more constrained early binding first. Late binding is more powerful, and so typing it is harder, handling failures is harder. I think it is hard for people knowing a lot of higher level theory to make decisions in the abstract because there are many ways to do things and the consequences are usually not understood well enough to choose in conjunction with other choices.

If I may give an example: in Felix, I decided row polymorphism might be useful. There are two ways to do it: with constraints that ensure record fields are unique, or without constraints, which allows duplicate fields. I implemented the second kind, because, the existing unification engine would "just work" with natural extensions, whereas to add constraints looked like a complete rewrite. So the choice was the one which was natural in the existing infrastructure, not one decided by arguing about theoretical merits.

keean commented

@shelby3 So I have spent the past couple of days implementing constructors for the AST objects. This has increased the code size by 1/3 for no extra functionality, and I not cannot use JSON.stringify to dump the AST to a file, so I now have to write my own pretty-printer for the tree.

I am not convinced it is any better, I need to work with it a bit.

Some opinion from the author of the D programming language...

Walter Bright wrote:

Implementation

How best to implement it? I hope I can at least set you off in the right direction. The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well. I'll close that with the famous quote from Jamie Zawinski:

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

Somewhat more controversial, I wouldn't bother wasting time with lexer or parser generators and other so-called "compiler compilers." They're a waste of time. Writing a lexer and parser is a tiny percentage of the job of writing a compiler. Using a generator will take up about as much time as writing one by hand, and it will marry you to the generator (which matters when porting the compiler to a new platform). And generators also have the unfortunate reputation of emitting lousy error messages.

@skaller wrote:

certain things but this issue isn't appropriate for them. I'm not sure where is, perhaps the right place can be suggested?
...
I personally feel type inference is a bad idea. Another discussion.

I would suggest contributing to the Principle Typings type inference Issue #21.

I would also suggest things like: don't try to do the desired late binding yet. Keep it in mind. But implement the easier and more constrained early binding first. Late binding is more powerful, and so typing it is harder, handling failures is harder. I think it is hard for people knowing a lot of higher level theory to make decisions in the abstract because there are many ways to do things and the consequences are usually not understood well enough to choose in conjunction with other choices.

We are planning to first simulate it with early binding in TypeScript to test out the "look & feel". The typeclass concept is central to our design. Without, we wouldn't even be creating this language.

If I may give an example: in Felix, I decided row polymorphism might be useful. There are two ways to do it: with constraints that ensure record fields are unique, or without constraints, which allows duplicate fields. I implemented the second kind, because, the existing unification engine would "just work" with natural extensions, whereas to add constraints looked like a complete rewrite. So the choice was the one which was natural in the existing infrastructure, not one decided by arguing about theoretical merits.

For sure many aspects are interdependent. We've sort of gone down that analysis in the Subtype Issue #8 and recently with #21. Perhaps you can place any comments there if you find there are helpful and if you feel you can understand our long discussions there. Frankly afaics we have a ways to go yet on unification and especially with the unions #2 concept we are contemplating.

@keean wrote:

So I have spent the past couple of days implementing constructors for the AST objects. This has increased the code size by 1/3 for no extra functionality, and I not cannot use JSON.stringify to dump the AST to a file, so I now have to write my own pretty-printer for the tree.

I am not convinced it is any better, I need to work with it a bit.

I am jealous you get to do coding whilst I expend all my time past couple of days trying to think, debate, and explain (convince you or better understand your side) about getting the grammar and syntax right.

Okay good let's see what is best. I haven't checked to see if you've posted any of this sample code on the project repository, so I can see what you've got.

Well .. Pike produced Go, which has a really clever process handling core based on CSP, but the rest of the language is a weak mess. So I'm not sure I'd take any notice of Pike. Bright is in fact very smart and his advice is worth noting, however his comments are out of date. Also, it is not true that parsing is only a small part of a language, it tends to be quite a significant part, perhaps unfortunately.

There is a reason for this: hand written parsers are almost impossible to get working when the language gets complex. Lookup is also very hard to get right. But once we have bound terms the reductions to reach the back end language follow the mathematics and are typically quite simple and modular. Also error handling is hardly required because all the checking is done in the front end.

In my view languages with fixed syntax are in fact out of date. It is not enough to introduce new types and new subroutines, a decent language must allow them to be encapsulated in a sub-grammar or the amount of housekeeping becomes overwhelming.

In Felix I call these Domain Specific Sub Languages (DSSLs). It has several obvious ones: there is a facility for making regular definitions using combinators, there is syntax resembling Java objects and interfaces, syntax for defining things called chips to construct specific kinds of coroutines built out of lower level primitives, and of course there are operators for various commonly used constructors and functional operators.

Most of these features are implemented in the parser, the parser plus a support library, or perhaps a small amount of support in the compiler is useful. And what makes this possible is the ability to just write the desired EBNF and then use Scheme action code to translate it to known AST terms.

As mentioned before in Felix I have taken this to the extreme, it is not just some nice extensions that are implemented by adding some grammar to the compiler. The compiler has only a tiny bootstrap grammar, which can only parse one thing: EBNF grammars! The rest of the grammar is in the library, that is, the grammar is in USER SPACE. Its NOT part of the compiler.

So the best way, IMHO, to reduce the problem of implementing a parser, is to eventually bootstrap the parser as I have done because this means a tiny core hard coded parser is all that is required to read the actual language grammar, and the parser can be constructed on the fly, which means the compiler does not itself need to be re-compiled.

I suggest you actually look at the Go parser. If you can understand it you're doing better than me, yet the grammar it implements is trivial. This isn't just because its hand written, its because it is written in Go which, having no variant type, is wholly unsuited for writing a parser (or, indeed, a compiler).

@skaller wrote:

which has a really clever process handling core based on CSP, but the rest of the language is a weak mess.

That was the roughly the same appraisal I have. In discussion with @keean in May at the Rust forum, I became aware of Go's concurrency model having a use case, but the rest of the language looks like a mismatching of features, at least from my current perspective and level of insight.

In my view languages with fixed syntax are in fact out of date. It is not enough to introduce new types and new subroutines, a decent language must allow them to be encapsulated in a sub-grammar or the amount of housekeeping becomes overwhelming.

LISP is very powerful this way. @keean and I did discuss this some last May (or perhaps it was my private or public discussions with @Smooth at Bitcointalk.org). I don't think we've put any thought yet into this for the language we are designing.

As mentioned before in Felix I have taken this to the extreme, it is not just some nice extensions that are implemented by adding some grammar to the compiler. The compiler has only a tiny bootstrap grammar, which can only parse one thing: EBNF grammars! The rest of the grammar is in the library, that is, the grammar is in USER SPACE. Its NOT part of the compiler.

Yeah but the problem is readability. How can novices ever read the code, when the grammar and syntax is infinitely variable. That is great for power users who don't need anyone to ever read their code, but it is a tradeoff for open source and mainstream programming.

So are you implying we could implement our language in Felix?

I suggest you actually look at the Go parser. If you can understand it you're doing better than me, yet the grammar it implements is trivial. This isn't just because its hand written, its because it is written in Go which, having no variant type, is wholly unsuited for writing a parser (or, indeed, a compiler).

Lol. 😆

"Yeah but the problem is readability. How can novices ever read the code, when the grammar and syntax is infinitely variable."

The answer is, provided the added grammar is well designed MUCH more easily!
This is particularly true if the added syntax happens to model existing practice in the target domain.

Examples? OK: regular expression DSSL:

regdef identifer = letter (letter | digit | "_") *;
regdef texid = slosh letter+;

This is just housekeeping it actually means something like:

var identifier = Seq (letter, Seq ( Alt ( Alt (letter, digit), String "_"))));
...

Another example:

interface X {
get: 1 -> int;
set: int -> 0;
}

That's just StupidJavaObjectSyntax DSSL, it means

typedef X = (get: 1->int, set:int -> 0);

where the RHS is a just a record type. And

object f (var x:int) implements X = {
method fun get () => x;
method proc set (y:int) { x = y; }
}

is just syntax for

fun f (var x:int) {
fun get () => x;
proc set (y:int) { x = y; }
return (get=get, set=set);
}

which is an ordinary function returning a record whose fields are nested functions closed over the scope of the function. This is an old trick from CLOS. The syntax helps get the return statement correct by marking the methods "method". But the point is that, the notation "object" makes it appear like you're in Java land dealing with objects and interfaces.

I implemented that purely for fun! And then I found it was in fact really useful! I used it for defining the interfaces and bodies of plugins. It is just syntax but you know you have an object with perfect encapsulation courtesy of functional abstraction, and the methods are closures arranged in a record. Which makes them amenable to row polymorphism.

Here's one I added recently, quite simple but very very useful:

old syntax (1,2,3).list, new syntax [(1,2,3)]. Both are mappings from arrays to lists, but the new syntax seems more pleasing.

Now the thing is, learning syntax is just as easy, or just as hard, as learning the names of functions in a library. But once you learn it the syntax is MUCH easier to read and deploy. Its domain specific, yes, but so is a library. The perfect example of why syntax matters is Scheme. Its unreadable and pretty much impossible to write, BECAUSE it doesn't have enough syntax.

People recognize patterns. They're often called idioms. Common idioms in C such as while(p)_q++=_p++ are so well recognized they're used instead of strcpy(). So syntax is very very important to comprehension. Arbitrary extensions .. just as bad as arbitrary library functions.

I have some current syntax for coroutines and channels using a "chip" definition and a "circuit" statement to connect chips. Its mainly housekeeping for something you could already do with a lot more code but the fact it styalises the generated code according to a pattern mechanically makes it possible to reason about in a way arbitrary long hand code could not be reasoned about.

BTW: No, I am not implying you could implement your language in Felix. I mean of course you could, but I'm not suggesting it. (You don't want even more dependencies!)

You could use the parser as a stand alone tool (but as mentioned elsewhere you would need Ocaml to compile the tool). Felix may, however, be useful as a test bed for some ideas, its probably easier to implement them in Felix which is already a full scale language, than in your own, which as yet barely exists.

For example if you define your grammars in EBNF, in particular with Felix EBNF, the Felix compiler can just parse any code using that grammar right now. It will barf if GLR+ can't handle it. You can just make the Scheme action codes empty, so it will just test the grammar.

FYI: Felix has no keywords. Identifiers in context only. Scannerless parser.

keean commented

@shelby3 I have refactored the compiler to use object classes for the AST, so the AST now looks like this:

module.exports = (() => {
    "use strict"

    var Expression = {}

    function Literal_Int(v) {this.value = v}
    Literal_Int.__proto__ = Expression
    Literal_Int.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Literal_Int ' + this.value
    }

    function Variable(n) {this.name = n}
    Variable.__proto__ = Expression
    Variable.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Variable ' + this.name
    }

    function Application(n, a) {this.name = n; this.args = a}
    Application.__proto__ = Expression
    Application.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Application ' + this.name + '\n' + this.args.map((x) => x.toString(indent+2)).join('\n')
    }

    function Fn(n, a, e) {this.name = n; this.args = a; this.body = e}
    Fn.__proto__ = Expression
    Fn.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Function "' + this.name + '" (' + this.args.map((x) => x.toString(indent+2)).join(', ') + ')\n' + this.body.toString(indent+2)
    }


    var Statement = {}

    function Declaration(n, e) {this.name = n; this.expression = e}
    Declaration.__proto__ = Statement
    Declaration.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Declaration ' + this.name + '\n' + this.expression.toString(indent+2)
    }

    function Assignment(n, e) {this.name = n; this.expression = e}
    Assignment.__proto__ = Statement
    Assignment.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Assignment ' + this.name + '\n' + this.expression.toString(indent+2)
    }

    function Return(e) {this.expression = e}
    Return.__proto__ = Statement
    Return.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Return\n' + this.expression.toString(indent+2)
    }

    function Block(b) {this.statements = b}
    Block.__proto__ = Expression
    Block.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Block\n' + this.statements.map((x) => x.toString(indent+2)).join('\n')
    }


    return {
        Expression : Expression,
        Statement : Statement,
        Literal_Int : Literal_Int,
        Variable : Variable,
        Application : Application,
        Fn : Fn,
        Declaration : Declaration,
        Assignment : Assignment,
        Return : Return,
        Block : Block
    }

})()

The generator now looks like this:

function gen_exp(ast) {
    if (ast instanceof AST.Literal_Int) {
        return ast.value.toString()
    } else if (ast instanceof AST.Variable) {
        return ast.name
    } else if (ast instanceof AST.Application) {
        return ast.name + '(' + ast.args.map(gen_exp).join(',') + ')'
    } else if (ast instanceof AST.Fn) {
        return gen_fn(ast)
    } else {
        return ''
    }
}

And the parser"

    block = Parsimmon.succeed({}).chain(() => {
        var indent = Indent.get()
        return Parsimmon.seqMap(
            newline.many().then(Indent.relative(Indent.gt).map((i) => Indent.set(i))).then(statement),
            (newline.many().then(Indent.relative(Indent.eq).map((i) => Indent.set(i))).then(statement)).many(),
            (first, blk) => {
                blk.unshift(first)
                Indent.set(indent)
                return new AST.Block(blk)
            }   
        )
    })

However the dis-advantage is the AST is no longer just JSON, because the classes of the objects matter.

I haven't committed this version as I am not convinced it is better.

The next steps would be to refactor the generator into AST methods, so we would lose the if ... else structure of the generator, and instead distribute it over the object methods.

So the question is, what is the better structure for a compiler. by function, parser, optimiser, generator, or by AST tree node, where all the methods relating to a node are defined there?

I don't like it. Editing __proto__ will make your compiler considerably slower, for no good reason.

What's the target for your compiler? ES5? ES6? Certain browser versions? I assume you don't plan on supporting Internet Explorer anyway, since you use Function.name (an ES6 feature), which is not supported in any version of IE.

keean commented

@SimonMeskens wrote:

I don't like it. Editing proto will make your compiler considerably slower, for no good reason.

I am using Node v4.4.6, running in the browser is not a priority.

keean commented

@SimonMeskens However I don't really agree with the addition of classes and classical inheritance to JavaScript, its a prototype inheritance based language. As such I want the prototypes for the AST nodes to fall into two categories, Expression and Statement, rather than being Function (which really makes no sense at all).

keean commented

@SimonMeskens I will just get rid of setting the proto its not used anywhere yet, although I thought it would be useful to be able to use instanceof on AST node groupings as well as the node types themselves.

In general I am finding the inability to easily serialise and deserialise the AST to JSON is a big disadvantage of trying to use object-classes, and I think I was better off using anonymous objects and selecting based on the properties. It just seems to add a lot of boilerplate. Although I will probably complete the refactoring and check it in as a branch for people to compare.

keean commented

Without the 'proto' stuff:

module.exports = (() => {
    "use strict"

    function Literal_Int(v) {this.value = v}
    Literal_Int.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Literal_Int ' + this.value
    }

    function Variable(n) {this.name = n}
    Variable.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Variable ' + this.name
    }

    function Application(n, a) {this.name = n; this.args = a}
    Application.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Application ' + this.name + '\n'
            + this.args.map((x) => x.toString(indent+2)).join('\n')
    }

    function Fn(n, a, e) {this.name = n; this.args = a; this.body = e}
    Fn.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Function "' + this.name + '" ('
            + this.args.map((x) => x.toString(indent+2)).join(', ') + ')\n' + this.body.toString(indent+2)
    }

    function Declaration(n, e) {this.name = n; this.expression = e}
    Declaration.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Declaration ' + this.name + '\n' + this.expression.toString(indent+2)
    }

    function Assignment(n, e) {this.name = n; this.expression = e}
    Assignment.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Assignment ' + this.name + '\n' + this.expression.toString(indent+2)
    }

    function Return(e) {this.expression = e}
    Return.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Return\n' + this.expression.toString(indent+2)
    }

    function Block(b) {this.statements = b}
    Block.prototype.toString = function(indent) {
        return ' '.repeat(indent) + 'Block\n' + this.statements.map((x) => x.toString(indent+2)).join('\n')
    }

    return {
        Literal_Int : Literal_Int,
        Variable : Variable,
        Application : Application,
        Fn : Fn,
        Declaration : Declaration,
        Assignment : Assignment,
        Return : Return,
        Block : Block
    }

})()

Mutating __proto__ is very slow in Node too, it's based on V8. I was just wondering, since there are newer features of JavaScript that DO let you use instanceof without mutating __proto___. Btw, I really enjoy code that does it, it's just a pity it makes the code magnitudes slower.

Do you know of the existence of Symbol.hasInstance?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Symbol/hasInstance

Unfortunately, it requires a newer version of Node, as you can see here:
nodejs/node#7918

@keean have you shown all the code for this parser mockup? I don't see how the AST is being built by that parser. Where is the parser consuming tokens and constructing instances of the AST?

Also perhaps toString should be a typeclass, not bound to the data type. Thinking about how we would bootstrap the compiler may provide insights for the design of our language.

@keean JSHint is warning me "bad line breaking" for:

        return ' '.repeat(indent) + 'Application ' + this.name + '\n'
            + this.args.map((x) => x.toString(indent+2)).join('\n')

And:

        return ' '.repeat(indent) + 'Function "' + this.name + '" ('
            + this.args.map((x) => x.toString(indent+2)).join(', ') + ')\n' + this.body.toString(indent+2)

Put the + on the prior line, because otherwise it is not clear if it is unary or infix (although the parser should have been able to figure it out in the case of dead code following return).

So apparently JSHint is supporting ASI correctly contrary to Douglas Crockford's crock-of-sheep.

@keean use + Variable.name + ' ' instead of + 'Variable ', so copy+paste doesn't bite our a$$.

I will teach/suggest some new tricks. More to come...

Can we please compromise with 3 spaces indent as I requested upthread? I normally use 2 and @keean normally 4, so let's compromise in the middle. I changed my editor to use 3 by default now.

Btw, I am happy we are getting on with coding now! I want to code! So much talking.

Okay here is how I would change it to use a typeclass structure instead of OOP. I haven't worked out yet how to integrate this with Parsimmon or how I might otherwise choose to integrate with a lexer. I haven't tested this code, just wrote it and copied here. Note this is employing my module system and loader, which we can discuss/change in future discussion. I believe code should be formatted to be beautiful. Sloppy programmers, produce sloppy quality IMO. Readability is paramount. Notice the trailing comma on comma-delimited lists, which is intentional (and is also supported in the grammar I created).

I think this separation-of-concerns is much easier to read and maintain, especially later on bootstrap and our compiler is checking to make sure we didn't fail to provide an implementation of a typeclass.

I am perhaps interested to port this to TypeScript or N4JS, so we can having some type checking in our implementation.

AbtractSyntaxTree.js

/* Configure JSHint warnings, http://jshint.com/docs/options/, http://stackoverflow.com/questions/17535130/where-can-i-find-a-list-of-jshint-numeric-error-codes#17541721 */// jshint asi:true, devel:true, eqeqeq:true, esnext:true, strict:true, undef:true, -W100
(function() {
   'use strict'

   return Object.freeze({
      LiteralInt (v)       {this.value      = v                                    },
      Variable   (n)       {this.name       = n                                    },
      Application(n, a)    {this.name       = n; this.args       = a               },
      Fn         (n, a, e) {this.name       = n; this.args       = a; this.body = e},
      Declaration(n, e)    {this.name       = n; this.expression = e               },
      Assignment (n, e)    {this.name       = n; this.expression = e               },
      Return     (e)       {this.expression = e                                    },
      Block      (b)       {this.statements = b                                    },
   })
})()

I am not sure the above syntax produces named functions suitable for use with new?

ToStringDict.js

/* Configure JSHint warnings, http://jshint.com/docs/options/, http://stackoverflow.com/questions/17535130/where-can-i-find-a-list-of-jshint-numeric-error-codes#17541721 *//* globals asyncify, include */// jshint asi:true, devel:true, eqeqeq:true, esnext:true, strict:true, undef:true, -W100
(asyncify(function*() {
   'use strict'

   const dict = Object.freeze({
      LiteralInt  : LiteralInt,
      Variable    : Variable,
      Application : Application,
      Fn          : Fn,
      Declaration : Declaration,
      Assignment  : Assignment,
      Return      : Return,
      Block       : Block,
   })

   const toString = yield include(':type-fu/ToStringBind')

   function LiteralInt (indent, x) {
      return ' '.repeat(indent) + LiteralInt .name +  ' ' + x.value
   }

   function Variable   (indent, x) {
      return ' '.repeat(indent) + Variable   .name +  ' ' + x.name
   }

   function Application(indent, x) {
      return ' '.repeat(indent) + Application.name +  ' ' + x.name + '\n' +
         this.args.map((x) => toString(dict, x)(indent+2, x)).join('\n')
   }

   function Fn         (indent, x) {
      return ' '.repeat(indent) + Fn         .name + ' "' + x.name + '" (' +
         x.args.map((x) => toString(dict, x)(indent+2), x).join(', ') + ')\n' + toString(dict, x.body)(indent+2, x.body)
   }

   function Declaration(indent, x) {
      return ' '.repeat(indent) + Declaration.name +  ' ' + x.name + '\n' + toString(dict, x.expression)(indent+2, x.expression)
   }

   function Assignment (indent, x) {
      return ' '.repeat(indent) + Assignment .name +  ' ' + x.name + '\n' + toString(dict, x.expression)(indent+2, x.expression)
   }

   function Return     (indent, x) {
      return ' '.repeat(indent) + Return     .name + '\n' + toString(dict, x.expression)(indent+2, x.expression)
   }

   function Block      (indent, x) {
      return ' '.repeat(indent) + Block      .name + '\n' + x.statements.map((x) => toString(dict, x)(indent+2, x)).join('\n')
   }

   return dict
}))()

ToStringBind.js

/* Configure JSHint warnings, http://jshint.com/docs/options/, http://stackoverflow.com/questions/17535130/where-can-i-find-a-list-of-jshint-numeric-error-codes#17541721 *//* exported ToStringBind */// jshint asi:true, devel:true, eqeqeq:true, esnext:true, strict:true, undef:true, -W100
function ToStringBind(dict, x) {
   'use strict'

   switch(x.constructor.name) {   // http://stackoverflow.com/questions/1249531/how-to-get-a-javascript-objects-class
      case dict.LiteralInt .name: return dict.LiteralInt
      case dict.Variable   .name: return dict.Variable
      case dict.Application.name: return dict.Application
      case dict.Fn         .name: return dict.Fn
      case dict.Declaration.name: return dict.Declaration
      case dict.Assignment .name: return dict.Assignment
      case dict.Return     .name: return dict.Return
      case dict.Block      .name: return dict.Block
   }
}

Note in our future bootstrap, the transpiler can replace dict.LiteralInt.name with the literal strings to make it faster, or even we can devise some integer tagging (even perhaps a 64-bit cryptographic hash) scheme if we want more performance. I have it the slower way, because it has less potential for typos and copy+paste errors. We could actually match on the function references if we wanted to instead, but then it would not be cross DOM boundaries correctly (nor multiple copies of modules loaded, although our module system should prevent that if every code in our universe uses the same module loader).

Note I also could have done instead:

return dict[x.constructor.name]

But the assumes the implementation instance name is the same as the data type instance name, which generally won't be true (unless @keean figures out how to enforce universal coherence in open systems).

keean commented

@keean

use + Variable.name + ' ' instead of + 'Variable ', so copy+paste doesn't bite our a$$.

I just folded the lines when pasting into the comment. I normally have a wide screen so they are not actually folded in the source.

keean commented

@shelby3 wrote:

return Object.freeze({

I had not planned on freezing it as intermediate steps involve tree rewriting. It will create a lot more work if we have to copy the tree each time.

You can re-use check the tree after each set of transformations is applied in compiler debug mode to catch mistakes in the tree rewriting.

@keean wrote:

I just folded the lines when pasting into the comment. I normally have a wide screen so they are not actually folded in the source.

Okay, but fyi only, you quoted me incorrectly.

I had not planned on freezing it as intermediate steps involve tree rewriting.

Object.freeze isn't deep. I only froze the dictionary of data types that can be members of a tree.

keean commented

@shelby3 okay I have checked in the new version. After playing with explicitly printing the AST, I think dumping to JSON is better you can read and process it in other tools (to plot the AST graphically etc). So I have removed the pretty-printing, and changes the objects slightly so the AST could be reconstructed from the JSON.

None of the unit tests currently pass as the JSON needs to be updated, but it will compile from .zs to .js the same limited subset (with out of date grammar) that it did before.

@keean am I correct to presume that you did not maintain the typeclass structure (instead used the prototype) because it would be incompatible with Parsimmon?

keean commented

@shelby3 wrote:

am I correct to presume that you did not maintain the typeclass structure (instead used the prototype) because it would be incompatible with Parsimmon?

There is no compatibility problem as far as I know. I wanted to use new X because new is the fastest way to build objects. Keeping the methods on the prototype avoids having to copy the methods to the new object every time you construct a new AST node.

@keean the code I provided should work with new which sets the x.constructor which you can see my code was relying on. My code sample didn't need to attach the functions to the instance via prototype.

Keeping the methods on the prototype avoids having to copy the methods to the new object every time you construct a new AST node.

So what you are saying then is Parsimmon is incompatible with the typeclass way as per my code sample. You shouldn't need to copy anything if Parsimmon was congruent.

keean commented

So what you are saying then is Parsimmon is incompatible with the typeclass way as per my code sample. You shouldn't need to copy anything if Parsimmon was congruent.

No, that's not what I am saying.

Can we take this slowly because I am unfamiliar with what you are trying to achieve. Why would I want to look up the constructor by name, when I can just lookup an appropriately named property on an anonymous object?

In this code:

module.exports = (() => {
    "use strict"

    function Literal_Int(v) {
        this.type = 'literal_int'
        this.value = v
    }

    function Variable(n) {
        this.type = 'variable'
        this.name = n
    }

    function Application(n, a) {
        this.type = 'application'
        this.name = n
        this.args = a
    }

    function Fn(n, a, e) {
        this.type = 'function'
        this.name = n
        this.args = a
        this.body = e
    }

    function Declaration(n, e) {
        this.type = 'declaration'
        this.name = n
        this.expression = e
    }

    function Assignment(n, e) {
        this.type = 'assignment'
        this.name = n
        this.expression = e
    }

    function Return(e) {
        this.type = 'return'
        this.expression = e
    }

    function Block(b) {
        this.type = 'block'
        this.statements = b
    }

    return {
        Literal_Int : Literal_Int,
        Variable : Variable,
        Application : Application,
        Fn : Fn,
        Declaration : Declaration,
        Assignment : Assignment,
        Return : Return,
        Block : Block
    }

})()

The thing returned is the dictionary. I can import this module and then directly call the constructor I want with something like new AST.Block([]).

@keean yes you can. And my code also expected you to do that. I am referring to the changes you made from my code for your generator.js file, specifically using the prototype.

keean commented

@shelby3 You mean this code:

module.exports = (() => {
    "use strict"

var AST = require('../src/ast.js')

AST.Literal_Int.prototype.generate = function() {
    return this.value.toString()
}

AST.Variable.prototype.generate = function() {
    return this.name
}

AST.Application.prototype.generate = function() {
    return this.name + '(' + this.args.map((x) => x.generate()).join(',') + ')'
}

AST.Fn.prototype.generate = function() {
    return 'function ' + this.name + '(' + this.args.join(',') + '){' + this.body.generate() + '}'
}

AST.Declaration.prototype.generate = function() {
    return 'var ' + this.name + '=' + this.expression.generate() + ';'
}

AST.Assignment.prototype.generate = function() {
    return this.name + '=' + this.expression.generate() + ';'
}

AST.Return.prototype.generate = function() {
    return 'return ' + this.expression.generate() + ';'
}

AST.Block.prototype.generate = function() {
    return this.statements.map((x) => x.generate()).join('')
}

})()

Its pretty concise, in fact its shorter than the if/then/else style one it replaced.

I am not sure which code you are referring to replace this? If we have the AST node, and that node is an object, why not dispatch based on the prototype, it's going to be efficient in JavaScript.

keean commented

@shelby3 In the new language, there would be a Generate typeclass, which we would implement for each AST node type. One way to implement this would be to attach the type class instance functions to the prototypes of each of the types. This of course only works for single dispatch.

@keean wrote:

One way to implement this would be to attach the type class instance functions to the prototypes of each of the types. This of course only works for single dispatch.

I presume that is only the way we are implementing it for our hack to transpile to TypeScript (which was my suggestion). Once we have a type checker, we can implement dictionaries per my code sample. By not explicitly calling through ToStringBind as I did in my code, then you having nothing in your code corresponding to a typeclass bound, and thus you may make mistakes relying on OOP which will be difficult to refactor out later when we bootstrap.

Not Generate, rather ToJSON. Please name things descriptively.

keean commented

@shelby3 I got rid of toJSON and am using the default JSON.stringify, each AST type has a type property that can be used for reconstruction. This seemed the neatest way of doing it in the end.

generate is the code-generator that outputs the final JavaScript / TypeScript. It is simple because we will have completed the transpiling by this point, and the AST tree will represent the JavaScript/TypeScript program directly, so we just need to trivially construct the syntax.

keean commented

@shelby3 And I am suggesting that single parameter type classes get transpiled to prototype functions, as it will result in the fastest target code, short of monomorphisation (and we won't be able to fully monomorphise where we have union types).

@keean afaics attaching implementations to instances of data types breaks the case where different modules want to use different implementations of typeclasses, yet interopt on data. Regardless, how we transpile it is irrelevant to the point I made about how we will port it to Type-fu/ZenScript code:

By not explicitly calling through ToStringBind as I did in my code, then you having nothing in your code corresponding to a typeclass bound, and thus you may make mistakes relying on OOP which will be difficult to refactor out later when we bootstrap.

Any way, I am learning Parsimmon now (but having trouble reading at the moment and will probably need to sleep early), so I can decide which direction I am going on the code. I'll probably soon find it more efficient to code in my own repository than talk and argue. However, I am not sure if I can carry it to fruition because of my health condition. But I am working as hard as a human being possibly could to try to rectify this disability. I will continue to contribute here, and simply make suggestions once and leave it at that without further wranglings.

Basically if I understand it correctly thus far, parser combinators consume input and build the output which both are the state in the monad. And they are composable. So we can build larger parsers out of smaller ones. But in essence these are still recursive decent parsers and thus are limited to their attributes such as needing backtracking instead of token look ahead with a table lookup.

I think it might be better to have a table lookup based on the actual LL(2) grammar specification file. The table can be automatically generated from the first and follow sets. And then the table lookup can select which mini-parser to invoke for each production selected from the table. I will need to think about how to construct this in a composable way competitive or better than Parsimmon.

The great benefit is when we change the grammer, the changes will be automated into code. And we won't lose any flexibility in terms of functionality because have our own generator employing our own preference in coding for the AST.

I'll need to get some sleep and think about this more refreshed.

keean commented

@shelby3 Parser Combinators are related to Parser Expression Grammars: https://en.wikipedia.org/wiki/Parsing_expression_grammar which are more powerful that LL(k) or LR(k) grammars, but can be parsed in linear time by a packrat parser.

If we agreed to translate our Grammer from EBNF to PEG we could use the PEG.js: http://pegjs.org/ parser generator.

Unfortunately we cannot use SLK because it is not open-source and only provides a free license for testing purposes.

  • Maybe we can find a nice open-source parser-generator for JavaScript that will generate a lexer and parser from the EBNF, this is one option.
  • Alternatively we can translate the grammer into PEG and use PEG.js
  • Finally we can use Parser Combinators.

I think we should avoid hand writing out own lexer and recursive descent parser as this will be a lot of work, will be difficult to maintain and you want fast results :-)