typelevel/cats-parse

`p1.? ~ p2` != `(p1 ~ p2) | p2` for error reporting purposes

carueda opened this issue · 9 comments

Hello there!

Assume these base parsers:

val p1 = P.char('a')
val p2 = P.char('z')

Consider now these:

val p = p1.? ~ p2
val q = (p1 ~ p2) | p2

(that is, both accept the same input space: { "az", "z" }).

With an empty input string, p and q generate these errors:

p.parse("") ==> Left(
          Error(
            0,
            NonEmptyList(
              InRange(0, 'z', 'z'), 
              List()
            )
          )
        )

q.parse("") ==> Left(
          Error(
            0,
            NonEmptyList(
              InRange(0, 'a', 'a'),
              List(InRange(0, 'z', 'z'))
            )
          )

For error reporting purposes I prefer the latter response as it indicates that not only 'z' but also 'a' can appear at the point of failure.

So, for alternative parser cases like the above, I could just write them à la q to accomplish this "enriched" error reporting.

However, do I have to? It seems unnecessary and I would prefer to just write them in the p style example. Am I missing something? Thanks!

Thanks for the issue. So, you are raising a really good point.

your first parser p is really the same as (p1.map(Some(_)) | pure(None)) ~ p2. So you want some distributive law: (a | b) ~ c == (a ~ c) | (b ~ c)

We have a version of that law here:

property("b.orElse(c) ~ a == (b ~ a).orElse((!b) *> (c ~ a))") {

Which is a bit more complex, but it is only testing the results, not that the error messages are identical.

Maybe the issue in this case is because p1.? can always succeed: so from the parser's point of view, it successfully parsed p1.? by returning None. I can imagine something like an optimization rule such that if you have a union that ends with something that always succeeds, then push the product down:

def product[A, B](left: Product0[A], right: Product0[B]): Product0[(A, B)] =
  (left, right) match {
    case OneOf0(items), right if alwaysSucceeds(items.last) =>
      OneOf0(items.map { li => product(li, right) })
    case _ =>
      // do the regular thing.
  }

but this isn't quite right for the same reason we don't have the law (a | b) ~ c == (a ~ c) | (b ~ c) in general.

I think there is even a deeper question here: what does the error reporting tell you? Does it tell you all next inputs that would have succeeded, or does it tell you some next inputs that would have succeeded. Currently it is definitely the latter and not the former.

It would certainly be nice to be able to prove that we could report all next inputs that would have made some progress. I really haven't thought about it before this issue, so I don't know if that turns out to be a hard problem, or if it is tractable. I think it is actually deeply related to Parser0 here. The idea that p1.? returns success without advancing is the issue. Maybe Parser0 needs to return a List of possible errors in addition to a result in the case that it does not advance, so a subsequent failure could regenerate the full list? Maybe if we did that, the property of producing all the next inputs (all next strings that would have allowed making some progress from the current offset), would be tractable...

I wish I had a short answer for you. I guess the only general answer I have (which I'm not even sure is right) is to really avoid Parser0 unless you really need it since it can currently hide errors as you have discovered.

Thanks for the useful reply.

Well, "precise errors" is central to my specific use case (a mission scripting language for commanding autonomous underwater vehicles), where hints for autocompletion play a key role via some LSP-enabled editor like VSCode. In other words, a typical use case where interactive editing and good feedback is very important.

That code is not yet open source, unfortunately, but I have a few cases where the situation described here is a bit compounded, that is, with more optional pieces in order (p1.? ~ p2.? ~ p3.? ~ p4.? ~ p5.? ~ pz).

I'm very happy with cats-parse in general (thanks for all your efforts there!) and will definitely continue my migration from fastparse (along with other updates toward Scala 3), and am willing to incur the "cost" of the rewrites to avoid Parser0, at least for the time being. I think I have gotten a good handle on using cats-parse , but not in terms of its internals to try to provide any specific insights about the potential impact of providing similar error reporting for the preferable form of such kind of parsers. My quick intuition is that it should be tractable. (In fact, the non-Parse0 style is already handled with good errors, so, it's tractable already, right?) I feel like the Parser0-to-non-Parser0 rewrites could actually somehow be performed by cats-parse itself, perhaps only when instructed to do so via some flag or some variation to the operators. Just some thoughts.

Thanks again!

The following extension is proving useful for me:

extension [A](p: P[A]) {
  def ?:[B](that: P[B]): P[(Option[A], B)] = {
    val withThis = (p ~ that) map { (a, b) => (Some(a), b) }
    val onlyThat = that map { (None, _) }

    withThis | onlyThat
  }
}

so I can write

p1 ?: p2 ?: p3 ?: p4 ?: p5 ?: pz

That seems pretty useful... I wonder if we should add it do Parser[A].

Yes, pretty useful indeed!

Though the external solution (like the extended operator above) may be good enough for some users, probably worth adding it to the library itself, especially to make sure such enriched error reporting is explicitly captured in relevant tests, and well, assuming the current Parser0-based behavior in terms of "limited" error reporting will stay (at least for while).

I actually wonder if we could possibly remove Parser0 entirely if we added a new type like StateReader[A] which can read Caret, Index have a Monad and Defer. Then you could compose a StateReader with a Parser to get a Parser but always consuming.

Maybe at the end of the day it isn't very different from the current design (with Parser0 basically being StateReader | Parser) but my proposal above puts a lot of migration pain on folks.

Not sure I can provide any technical insight now, but I'd agree, alway good to avoid/minimize breaking changes (though we know the library is still evolving, not yet v1, ...)

On the other hand, re use cases again, I understand cats-parse wants to be competitive among its class in terms of parsing performance. I guess the broad use case category here is the parsing of input that is, in general, expected to be well-formed already (e.g., RPC, serialization, machine-to-machine communication, and the like). Rich error reporting, the other broad use case (which had me entering this issue) may compromise that performance goal at least to some extent. Back to the scenario in this issue, I can imagine p1.? ~ p2.? ~ p3.? ~ p4.? ~ p5.? ~ pz still being the way to express "ordered sequence of optional parsers ending in a required one," and with the most efficient (perhaps, not-so-rich error reporting) parsing as the default behavior; but with some opt-in mechanism to instruct cats-parse to enable rich error reporting, say, via some import cats.parse.richerrors.* (or something like that), but not having to change the parsers themselves.

Consider this case:

val p1 = Parser.length(1)
val p2 = Parser.length(2)

val p3 = p1.?.soft ~ p2
val p4 = (p1.soft ~ p2) | p2

p3.parse("ab") // this should fail because p1.? parses 1 character, then p2 cannot succeed
p4.parse("ab") // this succeeds because p1.soft ~ p2 fails but backtracks, then p2 succeeds

so, this kind of operation definitely isn't lawful with soft products.

closed by #389