`dunai`: `morphGS` is probably inefficient

Question

`dunai`: `morphGS` is probably inefficient

turion opened this issue a year ago · 7 comments

While investigating turion/rhine#227, I found that replacing the current definition of arrM in terms of morphGS with a direct definition gives a considerable speedup, and I'm trying here to explain where this speedup might come from. I'd recommend replacing the definition, because arrM is ubiquitous in every dunai program.

The implementations:

https://github.com/ivanperez-keera/dunai/blob/fc0559b0658adb868d474a5421d8d6dde5bb2ca8/dunai/src/Data/MonadicStreamFunction/Core.hs#LL122C1-L130C72

arrM :: Monad m => (a -> m b) -> MSF m a b
arrM f =
  -- This implementation is equivalent to:
  -- arrM f = go
  --   where
  --     go = MSF $ \a -> do
  --            b <- f a
  --            return (b, go)
  morphGS (\i a -> i a >>= \(_, c) -> f a >>= \b -> return (b, c)) C.id

The implementation in terms of morphGS has the advantage of not using the MSF constructor, but it is also much slower in runtime.

Lengthy derivation of example program evaluation

Example program

Let's consider this simple example program which creates a random string (using a fictitious function rand :: IO String) and prints it:

reactimate $ arrM (const rand) >>> arrM putStrLn

Let us expand the definitions of the functions to see how this is evaluated and executed:

= let msf = arrM (const rand) >>> arrM putStrLn in reactimate msf
= let msf = arrM (const rand) >>> arrM putStrLn in do
    (_, msf') <- unMSF msf ()
    reactimate msf'
= let msf = MSF $ \a -> do
        (b, sf1') <- unMSF (arrM (const rand)) a
        (c, sf2') <- unMSF (arrM putStrLn) b
        c `seq` return (c, sf1' >>> sf2')
  in do
    (_, msf') <- unMSF msf ()
    reactimate msf'
= do
    (b, sf1') <- unMSF (arrM (const rand)) ()
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'

Now it depends on how arrM is defined. Let us chose the simple direct implementation first (the one that is not used in the library):

The direct implementation

= do
    (b, sf1') <- let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
        in unMSF go ()
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    (b, sf1') <- let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
        in do
             b <- const rand ()
             return (b, go)
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do -- Let's not care too much where the let floats, GHC will figure out the best place
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    b <- const rand ()
    (b, sf1') <- return (b, go)
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do -- Actually do some IO! rand produces value r0
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    (b, sf1') <- return (r0, go)
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    (c, sf2') <- unMSF (arrM putStrLn) r0
    (_, msf') <- c `seq` return (c, go >>> sf2')
    reactimate msf'
= do
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    (c, sf2') <- let go' = MSF $ \b -> do
                                    c <- putStrLn b
                                    return (c, go')
                 in unMSF go' r0
    (_, msf') <- c `seq` return (c, go >>> sf2')
    reactimate msf'
= do
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    (c, sf2') <- let go' = MSF $ \b -> do
                                    c <- putStrLn b
                                    return (c, go')
                 in do
                      c <- putStrLn r0
                      return (c, go')
    (_, msf') <- c `seq` return (c, go >>> sf2')
    reactimate msf'
= do
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    let go' = MSF $ \b -> do
                        c <- putStrLn b
                        return (c, go')
    (c, sf2') <- do
                    c <- putStrLn r0
                    return (c, go')
    (_, msf') <- c `seq` return (c, go >>> sf2')
    reactimate msf'
= do -- Output r0
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    let go' = MSF $ \b -> do
                        c <- putStrLn b
                        return (c, go')
    (c, sf2') <- return ((), go')
    (_, msf') <- c `seq` return (c, go >>> sf2')
    reactimate msf'
= do
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    let go' = MSF $ \b -> do
                        c <- putStrLn b
                        return (c, go')
    (_, msf') <- () `seq` return ((), go >>> go')
    reactimate msf'
= do
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    let go' = MSF $ \b -> do
                        c <- putStrLn b
                        return (c, go')
    reactimate (go >>> go')

We've created two thunks, one for each component of the whole MSF, and executed both their bodies. From here on, the program will go on endlessly, and no new thunks are generated.

= do
    let go = MSF $ \a -> do
                       b <- const rand a
                       return (b, go)
    let go' = MSF $ \b -> do
                        c <- putStrLn b
                        return (c, go')
    (_, msf') <- unMSF (go >>> go') ()
    reactimate msf'
= ...

Maybe GHC does further optimizations (like inlining something or optimizing the return), I don't know. I didn't look at the core because I don't understand that well yet.

Now what happens if we define arrM in the current way?

The library implementation

...
= do
    (b, sf1') <- unMSF (arrM (const rand)) ()
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    (b, sf1') <- unMSF ((\f -> morphGS (\i a -> i a >>= \(_, c) -> f a >>= \b -> return (b, c)) C.id) (const rand)) ()
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    (b, sf1') <- unMSF (morphGS (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) C.id) ()
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    (b, sf1') <- let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
        in unMSF (go C.id) ()
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    (b, sf1') <- let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
        in do
                (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF C.id) ()
                return (b2, go msf')
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
    (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF C.id) ()
    (b, sf1') <- return (b2, go msf')
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
    (b2, msf') <- let pap = unMSF C.id in pap () >>= \(_, c) -> const rand a >>= \b -> return (b, c)
    (b, sf1') <- return (b2, go msf')
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'

Note that we have now created a thunk for the partial application unMSF msf.

= do
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
        pap = unMSF C.id
        goId = MSF $ \a -> return (a, goId)
    (b2, msf') <- return ((), goId) >>= \(_, c) -> const rand a >>= \b -> return (b, c)
    (b, sf1') <- return (b2, go msf')
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'

There are no references to the partial application pap anymore, it needs to be garbage collected!

= do
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
        goId = MSF $ \a -> return (a, goId)
    (b2, msf') <- return ((), goId) >>= \(_, c) -> const rand a >>= \b -> return (b, c)
    (b, sf1') <- return (b2, go msf')
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
    let goId = MSF $ \a -> return (a, goId)
    (b2, msf') <- const rand () >>= \b -> return (b, goId)
    (b, sf1') <- return (b2, go msf')
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do -- Actually do side effect
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
    let goId = MSF $ \a -> return (a, goId)
    (b2, msf') <- return (r0, goId)
    (b, sf1') <- return (b2, go msf')
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'
= do
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
    let goId = MSF $ \a -> return (a, goId)
    (b, sf1') <- return (r0, go goId)
    (c, sf2') <- unMSF (arrM putStrLn) b
    (_, msf') <- c `seq` return (c, sf1' >>> sf2')
    reactimate msf'

What's interesting here is that we've created two thunks for something where we only needed one thunk before. And worse, there was a third thunk from the partial application, which then needs to be garbage collection. So we might already suspect that (in the absence of clever optimizations), this version will use a constant factor more space, and put load on the garbage collector.

For the additional thunks, the reason is not so much the definition in terms of helper functions, but morphGS being higher order in the MSF! It takes an MSF as input (in this case a trivial identity function), and that needs to be carried around as well now.

I also believe that it cannot be optimized away easily, because morphGS doesn't apply unMSF fully (like e.g. >>> does), but only partially, unMSF msf. That way the optimizer cannot inline the definition of the MSF. But this is speculation, I don't really understand the optimizer in detail, and haven't looked at the Core.

Also, this partial application seems to trigger the garbage collector.

Anyways, let's continue the evaluation. Since the other arrM is defined in the same way, one might think that it would analogously produce a further thunk, but I believe that id is shared:

= do -- In analogy to the last derivation
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
    let goId = MSF $ \a -> return (a, goId)
    let go' msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> putStrLn a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go' msf')
    (_, msf') <- () `seq` return ((), go goId >>> go' goId)
    reactimate msf'
= do -- In analogy to the last derivation
    let go msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go msf')
    let goId = MSF $ \a -> return (a, goId)
    let go' msf = MSF $ \a2 -> do
                        (b2, msf') <- (\i a -> i a >>= \(_, c) -> putStrLn a >>= \b -> return (b, c)) (unMSF msf) a2
                        return (b2, go' msf')
    reactimate (go goId >>> go' goId)

Summary

I believe that with the current version, we probably use more time, space, and we introduce garbage collection in the first place although there is no internal state. The red flags are probably either higher order functions in MSF or partial application of unMSF. This is all theoretical, but it fits the observations in turion/rhine#227.

Answer 1 · 2023-06-22T05:24:53.000Z

@turion thanks a lot for this very detailed report. I know it's a lot to ask on top of what you've already done, but would you be able to put together a small benchmark that shows the difference in terms of performance with one and the other?

Answer 2 · 2023-06-22T08:39:29.000Z

Yes, I managed to find a benchmark using ListT that reproduces the issue, I believe. It's much more visible there because many MSFs are launched.

Reproduce with cabal run spaceleak --enable-profiling -- +RTS -hc -l-agu -i0.01 && eventlog2html spaceleak.eventlog on branch https://github.com/turion/dunai/tree/dev_spaceleak.

Library on branch `develop`, with benchmark added

89d4125 Adding spaceleak

Replaced `arrM` definition with direct implementation

110e026 Write out arrM

Huge performance win (time per tick is reduced a lot)
Peak memory usage reduced from 28 to 21 MB
Space leak still visible

Heavy inlining and strictification in many places (shotgun optimization)

11ba331 Inline many functions

Again huge performance win
Peak memory usage further reduced
Space leak still visible
Unclear yet which of the many optimizations had an effect, further research necessary

Answer 3 · 2023-06-22T08:41:16.000Z

The branch may serve as a starting point for further future optimizations, of which arrM is only one.

Answer 4 · 2023-06-22T15:28:02.000Z

Further bisecting of all the possible inlines suggests that inlining arr is also a huge performance win.

Answer 5 · 2023-06-23T08:05:40.000Z

My current strategy to identify the minimal set of necessary changes is:

Bisect on the profiling result until a commit is identified that improves performance
Cherry-pick that commit onto a branch improvements
Rebase the whole branch with all inlinings onto improvements
Repeat until the performance improving commit is already on improvements

Answer 6 · 2023-09-12T01:20:45.000Z

Just a note: Issue #375 addressed the introduction of a benchmark. It will require discussion before making changes, but it may be worth seeing if those benchmarks, or how they are used, should be improved.

Answer 7 · 2023-09-12T08:29:35.000Z

Trying the benchmarks briefly seems to show that these changes here have a significant impact on the benchmarks. I think it would be very helpful if we could plot benchmarks (+ error bars) for several commits to compare them.