ParserIO?
dminuoso opened this issue · 11 comments
Having IO in a parser might be a useful thing for some of the following reasons:
a) keeping non-trivial state via IORef, while keeping nice small unboxed return types
b) logging during parsing
c) to efficiently poke into mutable buffers while parsing
So I came up with this type
https://gist.github.com/dminuoso/6de513c3e98ba67588957c19ce4eacbd
This allows us to:
a) Run full blown parser over IO
b) Unsafely run IOful parsers in a pure parser
c) Reuse all the Parser primitives via liftP
without having to rewrite them again.
d) Throw exceptions to generate uncatchable errors (say assertions)
If we want this, I do not think a variant for Stateful is really necessary, given that ParserIO essentially subsumes it. You could use ReaderT Env
(or implicit params) for a convenient reader environment, and IORefs give you limitless
If you're fine with this, I will move forward with a PR and some tests.
This seems very very cool to me. Would be keen to look at how it can be exploited. possible flatparse parser for aeson...? (edit: aeson seems to be stateless anyway)
I thought about threading State#
when first defining the Parser
type but then skipped it because I didn't need it. I do agree that it can be useful. There should be roughly no disadvantage to always threading State#
, because it's erased from generated code and it does not really add more sequentiality than what we already have. There could be some hypothetical blocking of GHC optimizations but I don't expect this to be significant.
If we want to thread State#
then it should be just included in all basic Parser
definitions, without any additional code duplication. I haven't yet thought deeply about the API for running parsers in IO and running IO in parsers. One suggestion:
- Add an extra
m
parameter toParser
types. HaveliftIO :: IO a -> Parser IO a
,runParser :: Parser Identity a -> Result a
andrunParserIO :: Parser IO a -> IO (Result a)
(roughly speaking). This is somewhat similar toPrimMonad
inprimitive
, but we really only want to supportIO
,Identity
, and perhapsST
. We could also put a class with these instances in the library, and overloadliftIO
to also be able to runST
.
Stateful
should not be dropped because it is significantly more efficient than what can be achieved by mutable refs. The point of Stateful
is to get the absolute best performance by threading a word in registers. Unfortunately I don't know any way to get rid of the Basic
-Stateful
code duplication in any reasonable way in current Haskell. We don't have sufficient representation polymorphism to do it robustly, and TH and Backpack have awful UX.
Apologies for the confusion. I did not mean to remove Stateful
, merely that my conceived ParserIO
newtype might not need an equivalent for Stateful
.
I did explore your idea of threading State# universally through all parsers a bit
https://github.com/dminuoso/flatparse-state/blob/main/app/Main.hs
(The main juicy bits are towards the bottom)
Here's my thoughts so far:
- You can seamlessly use the parser primitives irrespective of whether you are in ParserST, ParserIO or Parser (because all that happens is that the state token gets monomorphized). No wrapper functions are needed
- You can embed IO into pure parsers, which is much more safe than general unsafePerformIO because over the duration of the parser we still have the state threading.
- You can trivially embed ParserIO or ParserST into pure parsers, that allows for writing a section that maybe relies on IO/ST for some reason (in particular poking into mutable buffers)
- You get unpleasant unification errors along the lines of
expected NoIO, actual RealWorld
when mismatching say ParserIO with ParserST, but this would realistically not happen often. - This will obviously break code that conjures up
Parser
right now - but I think that's benign. Major version bump and proceed.
One thing that becomes very viable, is adding cabal flag controlled logging or assertions now, since you can simply write
do w <- foo
#ifdef DEBUG_ENABLE
liftIO (putStrLn ("foo is: " <> show w))
#endif
...
That looks pretty good. I like that only the state type is being varied. For better error messages we can just rename the state types to whatever, so that people don't necessarily need to be exposed to the RealWorld
,
State# RealWorld
is a wired-in type. While I have not done an exhaustive search, it appears there's at least demand analysis for imprecise exceptions. If we switch out RealWorld
for IO, this will lead to unsound code
https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L582-614
https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L772-807
Take particular note of
https://gitlab.haskell.org/ghc/ghc/-/blob/451aeac3b07f171f148995717d0d9a1eefe08f0e/compiler/GHC/Core/Opt/DmdAnal.hs#L781-782
If we insist on better diagnostic, we have to newtype the parsers, which essentially requires embedding each individual Parser into ParserST or ParserIO via some lift
functions.
I think it's not necessarily needed, this situation will only occur if you try an embed an ParserIO
action into something with a ParserST
or Parser
type, a scenario that does not seem overly likely.
What about type family-ing the state:
data Mode = PureMode | IOMode | STMode s
type family State (m :: Mode) where
State PureMode = Proxy# ()
State IOMode = State# RealWorld
State (STMode s) = State# s
type Res# m e a =
(#
(# State m, a, Addr# #)
| (# State m #)
| (# State m, e #)
#)
newtype Parser m e a = Parser {runParser# :: ForeinPtrContents -> Addr# -> Addr# -> State m -> Res# m e a}
runParser :: Parser PureMode e a -> Result e a
runParserIO :: Parser IOMode e a -> IO (Result e a)
runParserST :: (forall s. Parser (STMode s) e a) -> Result e a
With a bit of adaption I wrote this dminuoso/flatparse-state@788d9d8
Notes:
- We need an injective type family to make it work with our view patterns
- Switch State# and Proxy# around to satisfy injectivity - that is, we cant have both
State IOMode = State# RealWorld
andState (STMode s) = State# s
- but we can just switch them around and unsafeCoerce them back. As long as we dont do this withState# RealWorld
we should be fine - Amusingly the underlying parser newtype is a sort of almost monad transformer now, therefore I named it ParserT
This now produces the following error:
app/Main.hs:174:7: error:
• Couldn't match type: 'IOMode
with: 'PureMode
Expected: Parser () Int
Actual: ParserIO () ()
• In the expression: thing
In an equation for ‘err’: err = thing
|
174 | err = thing
| ^^^^^
Which reads nice with both the unification error on the token type, as well as with the outer type alias.
After pondering some more, I decided to compare the diagnostic with the one with just RealWorld:
app/Main.hs:177:7: error:
• Couldn't match type ‘RealWorld’ with ‘PureWorld’
Expected: Parser () Int
Actual: ParserIO () ()
• In the expression: thing
In an equation for ‘err’: err = thing
|
177 | err = thing
| ^^^^
It seems maybe we did not really gain much at the cost of extra complexity, some extra extensions, additional uses of unsafeCoerce. This could even be problematic, we would need to look into GHC further if State# or Proxy# is specially treated, and whether coercions between them is safe or not. And even if we establish it is, a future GHC update could introduce subtle bugs here.
The diagnostic should always mention the type alias as well, perhaps I did not look closely enough at the time I mentioned it.
In my opinion we should not do this. RealWorld seems fine here.
It seems maybe we did not really gain much at the cost of extra complexity, some extra extensions, additional uses of unsafeCoerce.
We should not need to coerce between state types in nominally safe code. There are some unnecessary coercions in your previous code:
- In
runParser
we should passproxy#
(Proxy
should be the state of pure things, is that a typo?) - In
runParserST
we should userunRW#
. liftST
doesn't need unsafeCoerce.
Instead of type families, we can just use type synonyms for state types, that saves us language extensions, and maybe the synonyms will persist in error messages (should be tested).
Generally, having Proxy#
, State# RealWorld
and State# s
as distinct state types gives us the greatest safety and precision w.r.t GHC compilation (if we want to have IO, ST and pure code at the same time).
I don't think it makes sense to have fixed State# RealWorld
threading. First, we would still need to distinguish Parser
and ParserIO
in user-facing API, so why not just make the distinction in state types. Second, while this would be fine for pure and IO parsing (because I doubt that there's any penalty to RealWorld
passing in pure parser code), for ST support we'd need to cast between State# RealWorld
and State# s
, which I don't really like.
dminuoso/flatparse-state@bda57a9
Okay, this change is actually simple to do, we can simply write
#if !MIN_VERSION_base(4,17,0)
type ZeroBitRep = 'TupleRep ('[] :: [RuntimeRep])
type ZeroBitType = TYPE ZeroBitRep
#endif
newtype ParserT (st :: ZeroBitType) e a = ParserT
{ runParserT# :: ForeignPtrContents
-> Addr#
-> Addr#
-> st
-> Res# st e a
}
type IOMode = State# RealWorld
type PureMode = Proxy# ()
type STMode s = State# s
type Parser = ParserT PureMode
type ParserIO = ParserT IOMode
type ParserST s = ParserT (STMode s)
Of course the diagnostic is getting slightly worse now:
app/Main.hs:186:7: error:
• Couldn't match type: State# RealWorld
with: Proxy# ()
Expected: Parser () Int
Actual: ParserIO () ()
• In the expression: thing
In an equation for ‘err’: err = thing
|
186 | err = thing
| ^^^^^
The type aliases are sadly not kept.
I have removed the unnecessary uses of unsafeCoerce. We can also write runParserST directly in terms of runRW#, but that is definitely not necessarily, given that unsafePerformIO will do just that, plus some lazy
which is actually missing from runST (will find the related GHC bug later today)
Semantically this looks great, at the cost of potentially exposing some magic hashes in rare diagnostics. Given that this seems to also produce Expected: Parser () Int Actual: ParserIO () ()
I'm personally fine with it.
This should be proxy#
passing: https://github.com/dminuoso/flatparse-state/blob/main/app/Main.hs#L131
Looks otherwise pretty good to me.