h0tk3y/better-parse

Discussion About Separation Between Tokens and Parsers

BenjaminHolland opened this issue · 2 comments

I've tried to use parser combinator libraries across multiple languages, and I've never seen the kind of hard distinction between tokens and parsers this library has. Perhaps I wasn't paying attention (This library is actually the one I've been the least frustrated with), but it's interesting and comes with a set of advantages and disadvantages. I'm interested in why this decision was made. I'd also like to get feedback on my own understanding of the concepts. This might also help you write good docs, or Id be willing to write them and PR if my understanding is good enough. feel free to close and ignore if neither of these discussions interest you.

The advantage is that it's very, very clear (after a bit of conceptual learning) what each piece of a grammar is for. Tokens are specifically about character sequence recognition, while parsers are about token sequence recognition and mapping. Once you get the distinction, it's easy to write grammars.

I see two main disadvantages.

  1. Being forced to declare tokens separately from parsers feels redundant. Consider val id by regexToken(\\w+) use { text }. This creates a token and a parser, but only registers the parser. The solution of val idToken by regexToken... val idParser by idToken use { text } is fine, but feels very clunky.

  2. It's not very easy to combine grammars, or reuse grammars as parsers in a parent grammar, specifically because tokens are separate entities. Consider two grammars A and B. If I want a third grammar C that expresses A or B, simply doing the obvious thing of setting the rootParser to this expression is insufficient, because C doesn't have the tokens defined in A and B, and in fact has no tokens at all. this problem gets worse with more grammars and deeper nesting. It's also not clear from your docs how such a merge operation should function.

Are these assessments fair? Am I missing something?
Thanks.

Thanks for bringing this up—was stuck on this for a good half hour because my token declarations were anonymous; this issue revealed my problem. So, while I agree with Benjamin's points, I'd also like to bring up the related issue that it wasn't clear to me in the documentation that this is required.

I've found this frequently vexing in my attempts to structure my code better.

At present I find myself resorting to this kind of thing:

    fun cacheLiteral(text: String, name: String) =
        map.computeIfAbsent(name to false) { literalToken(name, text) }
    fun cacheRegex(regex: Regex, name: String) =
        map.computeIfAbsent(name to true) { regexToken(name, regex) }

This feels weird, but... well, I had to store the tokens in something to pass them to DefaultTokenizer anyway, so....