gallais/idris-tparsec

Error reporting

clayrat opened this issue Β· 33 comments

This currently seems to be the biggest feature missing from TParsec. Do you have any suggestions on how to implement this better? Maybe by changing the Success structure to be a variant type Result = Error | Success?

I gave some ideas in gallais/agdarsec#8 but never got around to trying them out.

I've started the port of parameters branch.

With hindsight, I would tackle the problem in two steps:

  • Move to Parser : Parameters -> Type first (with just the usual tok, toks, m)
  • Then extend the Parameters record type to add instrumentation

The introduction of Parameters already raises some issues: I ended up using
Subset Char P.Tok in Text.Parser.Char for instance because that made instance
resolution easier.

Yes, that's the conclusion I've also reached after I started porting the examples :) In Idris the problem is not only with widening combinators and introducing Subsets, but also with types and implicits for running and constructing concrete parsers, e.g., for now I could only make everything work with code like this:

parseMaybe : (str : String) -> 
             (MonadRun mn, Instrumented p mn, Tokenizer (Tok p), 
              Subset (SizedList (Tok p) (length $ tokenize {tok = Tok p} str)) (Toks p n)) =>
             (All (Parser p mn a)) -> Maybe a
parseMaybe @{str} @{mr} @{is} @{to} @{su} par =
  let 
    input = MkSizedList $ tokenize @{to} str 
    result = runParser par lteRefl (into @{su} input)
    valid  = \s => toMaybe (Size s == Z) (Value s)
    in
  traverse valid (runMonad result) >>= head'
PAR' : All (Parser' ())
PAR' = fix (Parser' ()) $ \rec => 
        alts [ cmap () ((exact {p=Params} {mn=Maybe} LPAR `andm` rec) `land`
                        (exact {p=Params} {mn=Maybe} RPAR `andm` rec))
             , cmap () ((exact {p=Params} {mn=Maybe} LCUR `andm` rec) `land`
                        (exact {p=Params} {mn=Maybe} RCUR `andm` rec))
             , cmap () ((exact {p=Params} {mn=Maybe} LSQU `andm` rec) `land`
                        (exact {p=Params} {mn=Maybe} RSQU `andm` rec))
             ]

I'm actually inclined to try moving tok/toks back to the Parser type and leaving only the annotation in Parameters, this might help a lot with the resolution.

This is strange: in PAR' all the ps and mn should be solved by unification! :/

As for the runner, I think you could legitimately demand that the user provides a
function of type: (ts : List Tok) -> Toks (lenght ts).

Or you could introduce an ISubset:

record ISubset (S T : i -> Type) where
  iinto : {j : i} -> S j -> T j

and make the constraint more general: ISubset (SizedList (Tok p)) (Toks p).
This way you don't have to tokenize the string in the type which is... scary.

This is strange: in PAR' all the ps and mn should be solved by unification! :/

Ok, turns out mn is redundant indeed, but you still need to match on p. The problem is, AFAIU, that Idris doesn't expand the definition of records when running unification, thus Tok p doesn't evaluate to PAR because it doesn't want to look at what p is.

Unification not valid modulo beta-conversion? This sounds like a bug.

I have to go run some errands but I'll try to have a look later tonight.

The funny thing is that if you do something like this

PAR' : All (Parser' ())
PAR' = fix (Parser' ()) $ \rec => cmap () $ exact ?wat

inspecting the hole gives you everything you want:

  j : Nat
  j1 : Nat
  rec : Box (Parser (MkParameters PAR (SizedType (List PAR)) () Void) Maybe ()) j1
  mn : Type -> Type
  p : Parameters
  constraint : Functor Maybe
  constraint1 : Instrumented (MkParameters PAR (SizedType (List PAR)) () Void) Maybe
  j2 : Nat
  constraint2 : Alternative Maybe
  constraint3 : Monad Maybe
  constraint4 : Instrumented (MkParameters PAR (SizedType (List PAR)) () Void) Maybe
  constraint5 : Inspect (SizedType (List PAR)) PAR
  constraint6 : Eq PAR
--------------------------------------
wat : PAR

but as soon as you plug it with a constructor, you get

        Type mismatch between
                PAR
        and
                Tok p

I've pushed the current state of affairs to parameters branch if you'd like to take a look yourself.

Ok, I think I may understand what the issue is. You can get a similar error in
the REPL if you call exact LPAR with :t:

*Examples/Parentheses> :t exact LPAR
Can't find implementation for Eq (Tok p)

Possible cause:
        (input):1:4-13:When checking an application of function TParsec.Combinators.exact:
                Type mismatch between
                        PAR (Type of LPAR)
                and
                        Tok p (Expected type)

I think the problem is that Idris tries to infer the type of exact LPAR and faces a
unification constraint of the form PAR ~ Tok p. Instead of delaying the resolution
of the constraint to a later point in time it fails immediately. But if it had waited it
could have solved other unification problems which would have provided the extra
information needed for Tok p to become Tok Params which can compute to PAR.

The solution I've pushed is to use a variable instead of a concrete value. In that
case Idris can simply assign the variables the type Tok p which later gets refined
to Tok Params thanks to other successful unifications. So by introducing a let-binding
we can get Idris to assign the expression a type of the form PAR -> PAR -> ....

lRrR = \ p, q => cmap () ((exact p `andm` rec) `land` (exact q `andm` rec))

You can recreate the problem by replacing p in the body of lRrR with the PAR p.
We once again force Idris to consider an problem of the form PAR ~ Tok p and it
doesn't like it at all!

*Examples/Parentheses> :e
Type checking ./Examples/Parentheses.idr
./Examples/Parentheses.idr:41:52-57:
   |
41 |   let lRrR = \ p, q => cmap () ((exact (the PAR p) `andm` rec) `land` (exact q `andm` rec))
   |                                                    ~~~~~~
When checking right hand side of PAR' with expected type
        Parser (MkParameters PAR (SizedType (List PAR)) () Void) Maybe () j

When checking an application of function Induction.Nat.fix:
        Type mismatch between
                PAR
        and
                Tok p1

Holes: Examples.Parentheses.PAR'

Nice, thanks! I should add this to my personal collection of Idris hacks, which already includes such gem as "replacing foo x y with let t = foo x y in t" - this one, sadly, doesn't work in this case :)

I'm intrigued. When is it useful to replace foo x y with let t = foo x y in t?

For example, here: https://github.com/clayrat/idris-datadata/blob/master/src/DataData/Lec2.idr#L76
There were a few other instances, but this is the most recent one.

I realize this particular case is very similar to what we have above (let-binding a function), but there definitely were instances where t was a plain value that refused to typecheck otherwise, I just can't find them now somehow :(

Ok, seems the unification bug is happening only when you introduce custom tokens, and a mechanical workaround is to provide "specialized" versions of token combinators, like these:

exactTOK : TOK -> All (Parser' TOK)   
exactTOK = exact

maybeTOK : (TOK -> Maybe Char) -> All (Parser' Char)   
maybeTOK = maybeTok

So looks like I'm done with the port (at least of the current state in agdarsec), do you have any suggestions?

I haven't tested it but I think that the trick now is to define a monad transformer that
looks something like:

ParserT m = StateT (Position * List Annotation) (EitherT (Position * List Annotation) m)

and to make sure that the Alternative (ParserT m) instance implements empty by
grabbing the current state and returning that as the failure.

To find the motivation to write a position-aware example I've started a new project
(https://github.com/gallais/agdarky) yesterday evening. I now have the scopechecker & typechecker
ready and the next step is probably the parser. So pretty soon agdarsec should
be moving forward some more with ParserT.

I now have something I am pretty happy with in agdarky:
https://github.com/gallais/agdarky/blob/master/src/stlc/Parse.agda#L73

It is a more informative version of the STLC example. It is also perhaps a bit
confusing because I am encoding the parse result using my generic-syntax library
(this allows me to get scopechecking for free and then typechecking, renaming,
substitution, printing on the terms for free or relatively cheap).

Note that to make use of Instrumented's getPosition to annotate my parsed
term with source position information I had to add new combinators to agdarsec:
gallais/agdarsec@a84181e

iI've tried applying these ideas to the STLC example in 623aaf9, not sure if I got it right?

UPD: Seems like it's not, the returned position never goes beyond 0:1 :/

UPD2: Ah, I should've probably stored the positions in the parsed AST, not threaded them through. Still I wonder why it's not increasing. I suspect it's picking up a wrong implementation for the state monad...

Ah, I should've probably stored the positions in the parsed AST

Yeah, I think that would make the example more convincing.

Still I wonder why it's not increasing.

It's picking up the position when going through lam's mand getPosition (withSpaces var)
and ignores the rest because of the recursive call in language which drops the
Position with map snd . val.

So you're getting the position right after the '\' i.e. (0, 1).

Examples/STLCins> parse' "\\ x . (\\ y . y : 'a -> 'a) x" (val stlc)
Right (Lam (MkPosition 0 1)
           "x"
           (Emb (MkPosition 0 6)
                (App (MkPosition 0 27)
                     (Cut (MkPosition 0 15)
                          (Lam (MkPosition 0 8) "y" (Emb (MkPosition 0 13) (Var "y")))
                          (ARR (K "a") (K "a")))
                     (Emb (MkPosition 0 27) (Var "x"))))) : Either Error Val

It seems to work now! Though all the errors are still reported at 0:1 for some reason. Also it's a bit weird that Cut is annotated at colon in col 15 (though this seems expected) and App at col 27?

Though all the errors are still reported at 0:1 for some reason.

Strange indeed.

Also it's a bit weird that Cut is annotated at colon in col 15 (though this seems expected) and App at col 27?

In the parser the position of a Cut is considered to be the colon : and the position
of an App is considered to be the space between the function and its argument.
The Cut is in the function part of the application so it makes sense that its position
would come before that of the application.

I think I may understand what is happening.

When you write the simple grammar distinguishing two strings:

G := "ab" | "ba"

If the input string is "ac" then you enter the first branch, match 'a' and then fail on 'b'.
So you roll-back to try the second branch and fail immediately on 'b'. Instead of
reporting that error occurred in the first branch at the second character, you report that
it has happened on the second branch at the first character.

I suppose one solution would be a mechanism to "commit" to one branch (and completely
ignore the alternatives set up previously). In the STLC case the grammar is unambiguous
so committing as soon as one has read a '\' or an opening parenthesis should work.

Alternatively we could tweak the disjunction so that it doesn't throw away the error
in the first branch like it currently does:

  (ST a) <|> (ST b) = ST $ \pos => case a pos of 
                                     Left _ => b pos
                                     Right x => Right x

We could instead write something like this:

  (ST a) <|> (ST b) = ST $ \pos =>
    case a pos of 
      Right x => Right x
      Left e1 => case b pos of
        Right x => Right x
        Left e2 => Left $ heuristic e1 e2             

where heuristics picks its favourite error out of the two e.g. the "deepest" one.

This could be done with the current apparatus by comparing the positions and keeping
the one the furthest away from the start. Or we could add a Nat to the state + error
measuring how deep the context was when the failure was recorded (we'd increase that
Nat each time we go under a \\ or an opening parenthesis for instance)

It sounds like both mechanisms could be useful on their own - for unambiguous grammars the first one should probably be faster, while the second would help with ambiguous ones.

What I like about this is that it's all purely in the monad instance. So we don't
actually have to touch the combinators to change the behaviour of the parser!

Nice, I've ported the commit example. Next, I guess it'd be cool to cook up an example of a heuristic-driven reporting for an ambiguous case, and then move the relevant parts to the core. Then we can merge to master and finally close the issue :)

I've added a trivial example for the heuristic error: https://github.com/gallais/idris-tparsec/blob/parameters/src/Examples/Ambig.idr. We could replace it with the Nat-indexed version for STLC, as you described above; or maybe there's some other non-trivial ambiguous grammar you can think of?

As far as I can tell we are only using recordToken from Instrumented. I think we
should drop the other constraints as well as the Ann and Pos types in Parameters.
I also think it would be worth dropping Instrumented altogether and putting recordToken
into Parameters.

End users can always recover the same functionality based on the monad stack
they are using. We should make that clear in the examples but we do not need to
bake it into a specific constraint.

And that should yield the final design for this new version!

We're almost there: gallais/agdarsec#11

The newer things are in Text.Parser.Monad where we have different ready-to-use stacks:

  • AgdarsecT the most general monad transformer
  • Agdarsec the most general monad (~ AgdarsecT applied to the identity)
  • Agdarsecβ€² which corresponds to the former unInstr

So, I've ported your latest PR in 632fafd. Made some changes - since implementations have to be constructors, all transformers need wrapping in records. Also I've put most of the new stuff directly in TParsec.Types, moving Result out. Now I just need to port the instrumented version of STLC, and think up something about ambiguous example, do you have any advice on that one?

Fixed by bd740fe