Error reporting
clayrat opened this issue Β· 33 comments
This currently seems to be the biggest feature missing from TParsec. Do you have any suggestions on how to implement this better? Maybe by changing the Success
structure to be a variant type Result = Error | Success
?
I gave some ideas in gallais/agdarsec#8 but never got around to trying them out.
I've started the port of parameters
branch.
With hindsight, I would tackle the problem in two steps:
- Move to
Parser : Parameters -> Type
first (with just the usualtok
,toks
,m
) - Then extend the
Parameters
record type to add instrumentation
The introduction of Parameters
already raises some issues: I ended up using
Subset Char P.Tok
in Text.Parser.Char
for instance because that made instance
resolution easier.
Yes, that's the conclusion I've also reached after I started porting the examples :) In Idris the problem is not only with widening combinators and introducing Subset
s, but also with types and implicits for running and constructing concrete parsers, e.g., for now I could only make everything work with code like this:
parseMaybe : (str : String) ->
(MonadRun mn, Instrumented p mn, Tokenizer (Tok p),
Subset (SizedList (Tok p) (length $ tokenize {tok = Tok p} str)) (Toks p n)) =>
(All (Parser p mn a)) -> Maybe a
parseMaybe @{str} @{mr} @{is} @{to} @{su} par =
let
input = MkSizedList $ tokenize @{to} str
result = runParser par lteRefl (into @{su} input)
valid = \s => toMaybe (Size s == Z) (Value s)
in
traverse valid (runMonad result) >>= head'
PAR' : All (Parser' ())
PAR' = fix (Parser' ()) $ \rec =>
alts [ cmap () ((exact {p=Params} {mn=Maybe} LPAR `andm` rec) `land`
(exact {p=Params} {mn=Maybe} RPAR `andm` rec))
, cmap () ((exact {p=Params} {mn=Maybe} LCUR `andm` rec) `land`
(exact {p=Params} {mn=Maybe} RCUR `andm` rec))
, cmap () ((exact {p=Params} {mn=Maybe} LSQU `andm` rec) `land`
(exact {p=Params} {mn=Maybe} RSQU `andm` rec))
]
I'm actually inclined to try moving tok
/toks
back to the Parser
type and leaving only the annotation in Parameters
, this might help a lot with the resolution.
This is strange: in PAR'
all the p
s and mn
should be solved by unification! :/
As for the runner, I think you could legitimately demand that the user provides a
function of type: (ts : List Tok) -> Toks (lenght ts)
.
Or you could introduce an ISubset
:
record ISubset (S T : i -> Type) where
iinto : {j : i} -> S j -> T j
and make the constraint more general: ISubset (SizedList (Tok p)) (Toks p)
.
This way you don't have to tokenize the string in the type which is... scary.
This is strange: in PAR' all the ps and mn should be solved by unification! :/
Ok, turns out mn
is redundant indeed, but you still need to match on p
. The problem is, AFAIU, that Idris doesn't expand the definition of records when running unification, thus Tok p
doesn't evaluate to PAR
because it doesn't want to look at what p
is.
Unification not valid modulo beta-conversion? This sounds like a bug.
I have to go run some errands but I'll try to have a look later tonight.
The funny thing is that if you do something like this
PAR' : All (Parser' ())
PAR' = fix (Parser' ()) $ \rec => cmap () $ exact ?wat
inspecting the hole gives you everything you want:
j : Nat
j1 : Nat
rec : Box (Parser (MkParameters PAR (SizedType (List PAR)) () Void) Maybe ()) j1
mn : Type -> Type
p : Parameters
constraint : Functor Maybe
constraint1 : Instrumented (MkParameters PAR (SizedType (List PAR)) () Void) Maybe
j2 : Nat
constraint2 : Alternative Maybe
constraint3 : Monad Maybe
constraint4 : Instrumented (MkParameters PAR (SizedType (List PAR)) () Void) Maybe
constraint5 : Inspect (SizedType (List PAR)) PAR
constraint6 : Eq PAR
--------------------------------------
wat : PAR
but as soon as you plug it with a constructor, you get
Type mismatch between
PAR
and
Tok p
I've pushed the current state of affairs to parameters
branch if you'd like to take a look yourself.
Ok, I think I may understand what the issue is. You can get a similar error in
the REPL if you call exact LPAR
with :t
:
*Examples/Parentheses> :t exact LPAR
Can't find implementation for Eq (Tok p)
Possible cause:
(input):1:4-13:When checking an application of function TParsec.Combinators.exact:
Type mismatch between
PAR (Type of LPAR)
and
Tok p (Expected type)
I think the problem is that Idris tries to infer the type of exact LPAR
and faces a
unification constraint of the form PAR ~ Tok p
. Instead of delaying the resolution
of the constraint to a later point in time it fails immediately. But if it had waited it
could have solved other unification problems which would have provided the extra
information needed for Tok p
to become Tok Params
which can compute to PAR
.
The solution I've pushed is to use a variable instead of a concrete value. In that
case Idris can simply assign the variables the type Tok p
which later gets refined
to Tok Params
thanks to other successful unifications. So by introducing a let-binding
we can get Idris to assign the expression a type of the form PAR -> PAR -> ...
.
lRrR = \ p, q => cmap () ((exact p `andm` rec) `land` (exact q `andm` rec))
You can recreate the problem by replacing p
in the body of lRrR
with the PAR p
.
We once again force Idris to consider an problem of the form PAR ~ Tok p
and it
doesn't like it at all!
*Examples/Parentheses> :e
Type checking ./Examples/Parentheses.idr
./Examples/Parentheses.idr:41:52-57:
|
41 | let lRrR = \ p, q => cmap () ((exact (the PAR p) `andm` rec) `land` (exact q `andm` rec))
| ~~~~~~
When checking right hand side of PAR' with expected type
Parser (MkParameters PAR (SizedType (List PAR)) () Void) Maybe () j
When checking an application of function Induction.Nat.fix:
Type mismatch between
PAR
and
Tok p1
Holes: Examples.Parentheses.PAR'
Nice, thanks! I should add this to my personal collection of Idris hacks, which already includes such gem as "replacing foo x y
with let t = foo x y in t
" - this one, sadly, doesn't work in this case :)
I'm intrigued. When is it useful to replace foo x y
with let t = foo x y in t
?
For example, here: https://github.com/clayrat/idris-datadata/blob/master/src/DataData/Lec2.idr#L76
There were a few other instances, but this is the most recent one.
I realize this particular case is very similar to what we have above (let-binding a function), but there definitely were instances where t
was a plain value that refused to typecheck otherwise, I just can't find them now somehow :(
Ok, seems the unification bug is happening only when you introduce custom tokens, and a mechanical workaround is to provide "specialized" versions of token combinators, like these:
exactTOK : TOK -> All (Parser' TOK)
exactTOK = exact
maybeTOK : (TOK -> Maybe Char) -> All (Parser' Char)
maybeTOK = maybeTok
So looks like I'm done with the port (at least of the current state in agdarsec), do you have any suggestions?
I haven't tested it but I think that the trick now is to define a monad transformer that
looks something like:
ParserT m = StateT (Position * List Annotation) (EitherT (Position * List Annotation) m)
and to make sure that the Alternative (ParserT m)
instance implements empty
by
grabbing the current state and returning that as the failure.
To find the motivation to write a position-aware example I've started a new project
(https://github.com/gallais/agdarky) yesterday evening. I now have the scopechecker & typechecker
ready and the next step is probably the parser. So pretty soon agdarsec should
be moving forward some more with ParserT
.
I now have something I am pretty happy with in agdarky:
https://github.com/gallais/agdarky/blob/master/src/stlc/Parse.agda#L73
It is a more informative version of the STLC
example. It is also perhaps a bit
confusing because I am encoding the parse result using my generic-syntax library
(this allows me to get scopechecking for free and then typechecking, renaming,
substitution, printing on the terms for free or relatively cheap).
Note that to make use of Instrumented
's getPosition
to annotate my parsed
term with source position information I had to add new combinators to agdarsec:
gallais/agdarsec@a84181e
iI've tried applying these ideas to the STLC example in 623aaf9, not sure if I got it right?
UPD: Seems like it's not, the returned position never goes beyond 0:1
:/
UPD2: Ah, I should've probably stored the positions in the parsed AST, not threaded them through. Still I wonder why it's not increasing. I suspect it's picking up a wrong implementation for the state monad...
Ah, I should've probably stored the positions in the parsed AST
Yeah, I think that would make the example more convincing.
Still I wonder why it's not increasing.
It's picking up the position when going through lam
's mand getPosition (withSpaces var)
and ignores the rest because of the recursive call in language
which drops the
Position
with map snd . val
.
So you're getting the position right after the '\' i.e. (0, 1)
.
Examples/STLCins> parse' "\\ x . (\\ y . y : 'a -> 'a) x" (val stlc)
Right (Lam (MkPosition 0 1)
"x"
(Emb (MkPosition 0 6)
(App (MkPosition 0 27)
(Cut (MkPosition 0 15)
(Lam (MkPosition 0 8) "y" (Emb (MkPosition 0 13) (Var "y")))
(ARR (K "a") (K "a")))
(Emb (MkPosition 0 27) (Var "x"))))) : Either Error Val
It seems to work now! Though all the errors are still reported at 0:1 for some reason. Also it's a bit weird that Cut
is annotated at colon in col 15 (though this seems expected) and App
at col 27?
Though all the errors are still reported at 0:1 for some reason.
Strange indeed.
Also it's a bit weird that Cut is annotated at colon in col 15 (though this seems expected) and App at col 27?
In the parser the position of a Cut
is considered to be the colon :
and the position
of an App
is considered to be the space between the function and its argument.
The Cut
is in the function part of the application so it makes sense that its position
would come before that of the application.
I think I may understand what is happening.
When you write the simple grammar distinguishing two strings:
G := "ab" | "ba"
If the input string is "ac" then you enter the first branch, match 'a' and then fail on 'b'.
So you roll-back to try the second branch and fail immediately on 'b'. Instead of
reporting that error occurred in the first branch at the second character, you report that
it has happened on the second branch at the first character.
I suppose one solution would be a mechanism to "commit" to one branch (and completely
ignore the alternatives set up previously). In the STLC case the grammar is unambiguous
so committing as soon as one has read a '\' or an opening parenthesis should work.
Alternatively we could tweak the disjunction so that it doesn't throw away the error
in the first branch like it currently does:
(ST a) <|> (ST b) = ST $ \pos => case a pos of
Left _ => b pos
Right x => Right x
We could instead write something like this:
(ST a) <|> (ST b) = ST $ \pos =>
case a pos of
Right x => Right x
Left e1 => case b pos of
Right x => Right x
Left e2 => Left $ heuristic e1 e2
where heuristics
picks its favourite error out of the two e.g. the "deepest" one.
This could be done with the current apparatus by comparing the positions and keeping
the one the furthest away from the start. Or we could add a Nat
to the state + error
measuring how deep the context was when the failure was recorded (we'd increase that
Nat
each time we go under a \\
or an opening parenthesis for instance)
It sounds like both mechanisms could be useful on their own - for unambiguous grammars the first one should probably be faster, while the second would help with ambiguous ones.
What I like about this is that it's all purely in the monad instance. So we don't
actually have to touch the combinators to change the behaviour of the parser!
Nice, I've ported the commit example. Next, I guess it'd be cool to cook up an example of a heuristic-driven reporting for an ambiguous case, and then move the relevant parts to the core. Then we can merge to master
and finally close the issue :)
I've added a trivial example for the heuristic error: https://github.com/gallais/idris-tparsec/blob/parameters/src/Examples/Ambig.idr. We could replace it with the Nat-indexed version for STLC, as you described above; or maybe there's some other non-trivial ambiguous grammar you can think of?
As far as I can tell we are only using recordToken
from Instrumented
. I think we
should drop the other constraints as well as the Ann
and Pos
types in Parameters
.
I also think it would be worth dropping Instrumented
altogether and putting recordToken
into Parameters
.
End users can always recover the same functionality based on the monad stack
they are using. We should make that clear in the examples but we do not need to
bake it into a specific constraint.
And that should yield the final design for this new version!
We're almost there: gallais/agdarsec#11
The newer things are in Text.Parser.Monad
where we have different ready-to-use stacks:
AgdarsecT
the most general monad transformerAgdarsec
the most general monad (~AgdarsecT
applied to the identity)Agdarsecβ²
which corresponds to the formerunInstr
So, I've ported your latest PR in 632fafd. Made some changes - since implementations have to be constructors, all transformers need wrapping in records. Also I've put most of the new stuff directly in TParsec.Types
, moving Result
out. Now I just need to port the instrumented version of STLC, and think up something about ambiguous example, do you have any advice on that one?