`dunai`: `morphGS` is probably inefficient
turion opened this issue · 7 comments
While investigating turion/rhine#227, I found that replacing the current definition of arrM
in terms of morphGS
with a direct definition gives a considerable speedup, and I'm trying here to explain where this speedup might come from. I'd recommend replacing the definition, because arrM
is ubiquitous in every dunai
program.
The implementations:
arrM :: Monad m => (a -> m b) -> MSF m a b
arrM f =
-- This implementation is equivalent to:
-- arrM f = go
-- where
-- go = MSF $ \a -> do
-- b <- f a
-- return (b, go)
morphGS (\i a -> i a >>= \(_, c) -> f a >>= \b -> return (b, c)) C.id
The implementation in terms of morphGS
has the advantage of not using the MSF
constructor, but it is also much slower in runtime.
Lengthy derivation of example program evaluation
Example program
Let's consider this simple example program which creates a random string (using a fictitious function rand :: IO String
) and prints it:
reactimate $ arrM (const rand) >>> arrM putStrLn
Let us expand the definitions of the functions to see how this is evaluated and executed:
= let msf = arrM (const rand) >>> arrM putStrLn in reactimate msf
= let msf = arrM (const rand) >>> arrM putStrLn in do
(_, msf') <- unMSF msf ()
reactimate msf'
= let msf = MSF $ \a -> do
(b, sf1') <- unMSF (arrM (const rand)) a
(c, sf2') <- unMSF (arrM putStrLn) b
c `seq` return (c, sf1' >>> sf2')
in do
(_, msf') <- unMSF msf ()
reactimate msf'
= do
(b, sf1') <- unMSF (arrM (const rand)) ()
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
Now it depends on how arrM
is defined. Let us chose the simple direct implementation first (the one that is not used in the library):
The direct implementation
= do
(b, sf1') <- let go = MSF $ \a -> do
b <- const rand a
return (b, go)
in unMSF go ()
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
(b, sf1') <- let go = MSF $ \a -> do
b <- const rand a
return (b, go)
in do
b <- const rand ()
return (b, go)
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do -- Let's not care too much where the let floats, GHC will figure out the best place
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
b <- const rand ()
(b, sf1') <- return (b, go)
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do -- Actually do some IO! rand produces value r0
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
(b, sf1') <- return (r0, go)
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
(c, sf2') <- unMSF (arrM putStrLn) r0
(_, msf') <- c `seq` return (c, go >>> sf2')
reactimate msf'
= do
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
(c, sf2') <- let go' = MSF $ \b -> do
c <- putStrLn b
return (c, go')
in unMSF go' r0
(_, msf') <- c `seq` return (c, go >>> sf2')
reactimate msf'
= do
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
(c, sf2') <- let go' = MSF $ \b -> do
c <- putStrLn b
return (c, go')
in do
c <- putStrLn r0
return (c, go')
(_, msf') <- c `seq` return (c, go >>> sf2')
reactimate msf'
= do
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
let go' = MSF $ \b -> do
c <- putStrLn b
return (c, go')
(c, sf2') <- do
c <- putStrLn r0
return (c, go')
(_, msf') <- c `seq` return (c, go >>> sf2')
reactimate msf'
= do -- Output r0
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
let go' = MSF $ \b -> do
c <- putStrLn b
return (c, go')
(c, sf2') <- return ((), go')
(_, msf') <- c `seq` return (c, go >>> sf2')
reactimate msf'
= do
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
let go' = MSF $ \b -> do
c <- putStrLn b
return (c, go')
(_, msf') <- () `seq` return ((), go >>> go')
reactimate msf'
= do
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
let go' = MSF $ \b -> do
c <- putStrLn b
return (c, go')
reactimate (go >>> go')
We've created two thunks, one for each component of the whole MSF
, and executed both their bodies. From here on, the program will go on endlessly, and no new thunks are generated.
= do
let go = MSF $ \a -> do
b <- const rand a
return (b, go)
let go' = MSF $ \b -> do
c <- putStrLn b
return (c, go')
(_, msf') <- unMSF (go >>> go') ()
reactimate msf'
= ...
Maybe GHC does further optimizations (like inlining something or optimizing the return
), I don't know. I didn't look at the core because I don't understand that well yet.
Now what happens if we define arrM
in the current way?
The library implementation
...
= do
(b, sf1') <- unMSF (arrM (const rand)) ()
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
(b, sf1') <- unMSF ((\f -> morphGS (\i a -> i a >>= \(_, c) -> f a >>= \b -> return (b, c)) C.id) (const rand)) ()
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
(b, sf1') <- unMSF (morphGS (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) C.id) ()
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
(b, sf1') <- let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
in unMSF (go C.id) ()
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
(b, sf1') <- let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
in do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF C.id) ()
return (b2, go msf')
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF C.id) ()
(b, sf1') <- return (b2, go msf')
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
(b2, msf') <- let pap = unMSF C.id in pap () >>= \(_, c) -> const rand a >>= \b -> return (b, c)
(b, sf1') <- return (b2, go msf')
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
Note that we have now created a thunk for the partial application unMSF msf
.
= do
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
pap = unMSF C.id
goId = MSF $ \a -> return (a, goId)
(b2, msf') <- return ((), goId) >>= \(_, c) -> const rand a >>= \b -> return (b, c)
(b, sf1') <- return (b2, go msf')
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
There are no references to the partial application pap
anymore, it needs to be garbage collected!
= do
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
goId = MSF $ \a -> return (a, goId)
(b2, msf') <- return ((), goId) >>= \(_, c) -> const rand a >>= \b -> return (b, c)
(b, sf1') <- return (b2, go msf')
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
let goId = MSF $ \a -> return (a, goId)
(b2, msf') <- const rand () >>= \b -> return (b, goId)
(b, sf1') <- return (b2, go msf')
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do -- Actually do side effect
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
let goId = MSF $ \a -> return (a, goId)
(b2, msf') <- return (r0, goId)
(b, sf1') <- return (b2, go msf')
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
= do
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
let goId = MSF $ \a -> return (a, goId)
(b, sf1') <- return (r0, go goId)
(c, sf2') <- unMSF (arrM putStrLn) b
(_, msf') <- c `seq` return (c, sf1' >>> sf2')
reactimate msf'
What's interesting here is that we've created two thunks for something where we only needed one thunk before. And worse, there was a third thunk from the partial application, which then needs to be garbage collection. So we might already suspect that (in the absence of clever optimizations), this version will use a constant factor more space, and put load on the garbage collector.
For the additional thunks, the reason is not so much the definition in terms of helper functions, but morphGS
being higher order in the MSF
! It takes an MSF
as input (in this case a trivial identity function), and that needs to be carried around as well now.
I also believe that it cannot be optimized away easily, because morphGS
doesn't apply unMSF
fully (like e.g. >>>
does), but only partially, unMSF msf
. That way the optimizer cannot inline the definition of the MSF
. But this is speculation, I don't really understand the optimizer in detail, and haven't looked at the Core.
Also, this partial application seems to trigger the garbage collector.
Anyways, let's continue the evaluation. Since the other arrM
is defined in the same way, one might think that it would analogously produce a further thunk, but I believe that id
is shared:
= do -- In analogy to the last derivation
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
let goId = MSF $ \a -> return (a, goId)
let go' msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> putStrLn a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go' msf')
(_, msf') <- () `seq` return ((), go goId >>> go' goId)
reactimate msf'
= do -- In analogy to the last derivation
let go msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go msf')
let goId = MSF $ \a -> return (a, goId)
let go' msf = MSF $ \a2 -> do
(b2, msf') <- (\i a -> i a >>= \(_, c) -> putStrLn a >>= \b -> return (b, c)) (unMSF msf) a2
return (b2, go' msf')
reactimate (go goId >>> go' goId)
Summary
I believe that with the current version, we probably use more time, space, and we introduce garbage collection in the first place although there is no internal state. The red flags are probably either higher order functions in MSF
or partial application of unMSF
. This is all theoretical, but it fits the observations in turion/rhine#227.
@turion thanks a lot for this very detailed report. I know it's a lot to ask on top of what you've already done, but would you be able to put together a small benchmark that shows the difference in terms of performance with one and the other?
Yes, I managed to find a benchmark using ListT that reproduces the issue, I believe. It's much more visible there because many MSFs are launched.
Reproduce with cabal run spaceleak --enable-profiling -- +RTS -hc -l-agu -i0.01 && eventlog2html spaceleak.eventlog
on branch https://github.com/turion/dunai/tree/dev_spaceleak.
Library on branch develop
, with benchmark added
89d4125 Adding spaceleak
Replaced arrM
definition with direct implementation
110e026 Write out arrM
- Huge performance win (time per tick is reduced a lot)
- Peak memory usage reduced from 28 to 21 MB
- Space leak still visible
Heavy inlining and strictification in many places (shotgun optimization)
11ba331 Inline many functions
- Again huge performance win
- Peak memory usage further reduced
- Space leak still visible
- Unclear yet which of the many optimizations had an effect, further research necessary
The branch may serve as a starting point for further future optimizations, of which arrM
is only one.
Further bisecting of all the possible inlines suggests that inlining arr
is also a huge performance win.
My current strategy to identify the minimal set of necessary changes is:
- Bisect on the profiling result until a commit is identified that improves performance
- Cherry-pick that commit onto a branch
improvements
- Rebase the whole branch with all inlinings onto
improvements
- Repeat until the performance improving commit is already on
improvements
Just a note: Issue #375 addressed the introduction of a benchmark. It will require discussion before making changes, but it may be worth seeing if those benchmarks, or how they are used, should be improved.
Trying the benchmarks briefly seems to show that these changes here have a significant impact on the benchmarks. I think it would be very helpful if we could plot benchmarks (+ error bars) for several commits to compare them.