AndrasKovacs/flatparse

ParserIO?

dminuoso opened this issue · 11 comments

Having IO in a parser might be a useful thing for some of the following reasons:

a) keeping non-trivial state via IORef, while keeping nice small unboxed return types
b) logging during parsing
c) to efficiently poke into mutable buffers while parsing

So I came up with this type

https://gist.github.com/dminuoso/6de513c3e98ba67588957c19ce4eacbd

This allows us to:

a) Run full blown parser over IO
b) Unsafely run IOful parsers in a pure parser
c) Reuse all the Parser primitives via liftP without having to rewrite them again.
d) Throw exceptions to generate uncatchable errors (say assertions)

If we want this, I do not think a variant for Stateful is really necessary, given that ParserIO essentially subsumes it. You could use ReaderT Env (or implicit params) for a convenient reader environment, and IORefs give you limitless

If you're fine with this, I will move forward with a PR and some tests.

This seems very very cool to me. Would be keen to look at how it can be exploited. possible flatparse parser for aeson...? (edit: aeson seems to be stateless anyway)

I thought about threading State# when first defining the Parser type but then skipped it because I didn't need it. I do agree that it can be useful. There should be roughly no disadvantage to always threading State#, because it's erased from generated code and it does not really add more sequentiality than what we already have. There could be some hypothetical blocking of GHC optimizations but I don't expect this to be significant.

If we want to thread State# then it should be just included in all basic Parser definitions, without any additional code duplication. I haven't yet thought deeply about the API for running parsers in IO and running IO in parsers. One suggestion:

  • Add an extra m parameter to Parser types. Have liftIO :: IO a -> Parser IO a, runParser :: Parser Identity a -> Result a and runParserIO :: Parser IO a -> IO (Result a) (roughly speaking). This is somewhat similar to PrimMonad in primitive, but we really only want to support IO, Identity, and perhaps ST. We could also put a class with these instances in the library, and overload liftIO to also be able to run ST.

Stateful should not be dropped because it is significantly more efficient than what can be achieved by mutable refs. The point of Stateful is to get the absolute best performance by threading a word in registers. Unfortunately I don't know any way to get rid of the Basic-Stateful code duplication in any reasonable way in current Haskell. We don't have sufficient representation polymorphism to do it robustly, and TH and Backpack have awful UX.

Apologies for the confusion. I did not mean to remove Stateful, merely that my conceived ParserIO newtype might not need an equivalent for Stateful.

I did explore your idea of threading State# universally through all parsers a bit

https://github.com/dminuoso/flatparse-state/blob/main/app/Main.hs

(The main juicy bits are towards the bottom)

Here's my thoughts so far:

  • You can seamlessly use the parser primitives irrespective of whether you are in ParserST, ParserIO or Parser (because all that happens is that the state token gets monomorphized). No wrapper functions are needed
  • You can embed IO into pure parsers, which is much more safe than general unsafePerformIO because over the duration of the parser we still have the state threading.
  • You can trivially embed ParserIO or ParserST into pure parsers, that allows for writing a section that maybe relies on IO/ST for some reason (in particular poking into mutable buffers)
  • You get unpleasant unification errors along the lines of expected NoIO, actual RealWorld when mismatching say ParserIO with ParserST, but this would realistically not happen often.
  • This will obviously break code that conjures up Parser right now - but I think that's benign. Major version bump and proceed.

One thing that becomes very viable, is adding cabal flag controlled logging or assertions now, since you can simply write

do w <- foo
#ifdef DEBUG_ENABLE
   liftIO (putStrLn ("foo is: " <> show w))
#endif
...

That looks pretty good. I like that only the state type is being varied. For better error messages we can just rename the state types to whatever, so that people don't necessarily need to be exposed to the RealWorld,

State# RealWorld is a wired-in type. While I have not done an exhaustive search, it appears there's at least demand analysis for imprecise exceptions. If we switch out RealWorld for IO, this will lead to unsound code

https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L582-614
https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L772-807

Take particular note of
https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L781-782

If we insist on better diagnostic, we have to newtype the parsers, which essentially requires embedding each individual Parser into ParserST or ParserIO via some lift functions.

I think it's not necessarily needed, this situation will only occur if you try an embed an ParserIO action into something with a ParserST or Parser type, a scenario that does not seem overly likely.

What about type family-ing the state:

data Mode = PureMode | IOMode | STMode s

type family State (m :: Mode) where
  State PureMode   = Proxy# ()
  State IOMode     = State# RealWorld
  State (STMode s) = State# s

type Res# m e a =
  (#
    (# State m, a, Addr# #)
  | (# State m #)
  | (# State m, e #)
  #)

newtype Parser m e a = Parser {runParser# :: ForeinPtrContents -> Addr# -> Addr# -> State m -> Res# m e a}

runParser :: Parser PureMode e a -> Result e a
runParserIO :: Parser IOMode e a -> IO (Result e a)
runParserST :: (forall s. Parser (STMode s) e a) -> Result e a

With a bit of adaption I wrote this dminuoso/flatparse-state@788d9d8

Notes:

  • We need an injective type family to make it work with our view patterns
  • Switch State# and Proxy# around to satisfy injectivity - that is, we cant have both State IOMode = State# RealWorld and State (STMode s) = State# s - but we can just switch them around and unsafeCoerce them back. As long as we dont do this with State# RealWorld we should be fine
  • Amusingly the underlying parser newtype is a sort of almost monad transformer now, therefore I named it ParserT

This now produces the following error:

app/Main.hs:174:7: error:
    • Couldn't match type: 'IOMode
                     with: 'PureMode
      Expected: Parser () Int
        Actual: ParserIO () ()
    • In the expression: thing
      In an equation for ‘err’: err = thing
    |
174 | err = thing
    |       ^^^^^

Which reads nice with both the unification error on the token type, as well as with the outer type alias.

After pondering some more, I decided to compare the diagnostic with the one with just RealWorld:


app/Main.hs:177:7: error:
    • Couldn't match type ‘RealWorld’ with ‘PureWorld’
      Expected: Parser () Int
        Actual: ParserIO () ()
    • In the expression: thing
      In an equation for ‘err’: err = thing
    |
177 | err = thing
    |       ^^^^


It seems maybe we did not really gain much at the cost of extra complexity, some extra extensions, additional uses of unsafeCoerce. This could even be problematic, we would need to look into GHC further if State# or Proxy# is specially treated, and whether coercions between them is safe or not. And even if we establish it is, a future GHC update could introduce subtle bugs here.

The diagnostic should always mention the type alias as well, perhaps I did not look closely enough at the time I mentioned it.

In my opinion we should not do this. RealWorld seems fine here.

It seems maybe we did not really gain much at the cost of extra complexity, some extra extensions, additional uses of unsafeCoerce.

We should not need to coerce between state types in nominally safe code. There are some unnecessary coercions in your previous code:

  • In runParser we should pass proxy# (Proxy should be the state of pure things, is that a typo?)
  • In runParserST we should use runRW#.
  • liftST doesn't need unsafeCoerce.

Instead of type families, we can just use type synonyms for state types, that saves us language extensions, and maybe the synonyms will persist in error messages (should be tested).

Generally, having Proxy#, State# RealWorld and State# s as distinct state types gives us the greatest safety and precision w.r.t GHC compilation (if we want to have IO, ST and pure code at the same time).

I don't think it makes sense to have fixed State# RealWorld threading. First, we would still need to distinguish Parser and ParserIO in user-facing API, so why not just make the distinction in state types. Second, while this would be fine for pure and IO parsing (because I doubt that there's any penalty to RealWorld passing in pure parser code), for ST support we'd need to cast between State# RealWorld and State# s, which I don't really like.

dminuoso/flatparse-state@bda57a9

Okay, this change is actually simple to do, we can simply write

#if !MIN_VERSION_base(4,17,0)
type ZeroBitRep = 'TupleRep ('[] :: [RuntimeRep])
type ZeroBitType = TYPE ZeroBitRep
#endif

newtype ParserT (st :: ZeroBitType) e a = ParserT
  { runParserT# :: ForeignPtrContents
                 -> Addr#
                 -> Addr#
                 -> st
                 -> Res# st e a
  }

type IOMode = State# RealWorld
type PureMode = Proxy# ()
type STMode s = State# s

type Parser = ParserT PureMode
type ParserIO = ParserT IOMode
type ParserST s = ParserT (STMode s)

Of course the diagnostic is getting slightly worse now:

app/Main.hs:186:7: error:
    • Couldn't match type: State# RealWorld
                     with: Proxy# ()
      Expected: Parser () Int
        Actual: ParserIO () ()
    • In the expression: thing
      In an equation for ‘err’: err = thing
    |
186 | err = thing
    |       ^^^^^

The type aliases are sadly not kept.

I have removed the unnecessary uses of unsafeCoerce. We can also write runParserST directly in terms of runRW#, but that is definitely not necessarily, given that unsafePerformIO will do just that, plus some lazy which is actually missing from runST (will find the related GHC bug later today)

Semantically this looks great, at the cost of potentially exposing some magic hashes in rare diagnostics. Given that this seems to also produce Expected: Parser () Int Actual: ParserIO () () I'm personally fine with it.

This should be proxy# passing: https://github.com/dminuoso/flatparse-state/blob/main/app/Main.hs#L131
Looks otherwise pretty good to me.