mrkkrp/megaparsec

How to report "lexical" errors?

Opened this issue · 3 comments

I'm building a parser that accepts custom token stream.
I've made TokenStream (from lexer-applicative) an instance of Stream

instance Stream (TokenStream (L tok)) where

And that's wonderful, everything worked as expected, until a "lexcial error" appear in my token stream

-- | A stream of tokens
data TokenStream tok
  = TsToken tok (TokenStream tok)
  | TsEof
  | TsError LexicalError

The parser complained about unexpected end of input, that's because I had no choice but had to treat TsError like TsEof.

I think there are 3 ways of solving this:

  1. Make Stream "aware" of these lexical errors: for example, let take1_ return a Either value instead of just a Maybe value.
  2. Make the parser incremental: so that users can check if the next token is TsError, before feeding it to the parser.
  3. The "happy" way, something between 1. and 2.

I'll explain more about how it can be done in happy:

Happy also allows user to choose their own type token stream (usually with alex). As long as we tell happy what is the token for eof:

%lexer { <lexer> } { <eof> }

and what to do when a token comes in:

lexer :: (Token -> P a) -> P a

For example, this is how to deal with a token stream from lexer-applicative:

lexer :: (Token -> P a) -> P a
lexer f = scanNext >>= f

scanNext :: P Token
scanNext = do
  stream <- gets tokenStream
  case stream of
    TsToken (L _ tok) stream -> return tok
    TsEof -> return TokenEOF
    TsError (LexicalError pos) -> throwError $ Lexical pos

I think this is the best among the 3 solutions, because it allows users to handle lexical errors the way they like, and it's not an overkill like making megaparsec incremental.

But I'm still not sure about how to incorporate this into the Stream class, if we are going to do this.

Should a token stream with an error in it be fed into a parser? You could just report the error because parsing won't succeed anyway.

Ideally you would not know if there's an error in a token stream, until you keep extracting from the stream and finally encounter one.

The workaround I'm using now is to force the whole stream into a list, and see if there's any error.

I don't know if you still need this, but another workaround is to have type Token s = Either String tok then throw a parser error whenever you get a Left. It'll unfortunately mean you'll end up with expected tokens that are always Right _, so you could use a label for that instead.