/haskell-perf

Primary LanguageHaskellApache License 2.0Apache-2.0

haskell-perf

GHC Patch: https://github.com/composewell/ghc/tree/ghc-8.10.7-eventlog-enhancements

Enable Linux perf counters

Enable unrestricted use of perf counters:

# echo -1 > /proc/sys/kernel/perf_event_paranoid

Disable CPU scaling

Set the scaling governer of all your cpus to performance:

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
...
...
echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor

Generating the eventlog

To generate the event log, we need to compile the program with the eventlog enabled and run the program setting the -l rts option.

There are multiple ways of doing this.

Using plain GHC:

ghc Main.hs -rtsopts -eventlog
./Main +RTS -l -RTS

Using Cabal:

The .cabal file should contain the following ghc options

ghc-options: -eventlog "-with-rtsopts=-l"

If the -threaded option is used while compiling. You may want to use the -N1 rts option.

Creating windows

Helper function to create windows:

{-# LANGUAGE BangPatterns #-}

import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)

{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
    liftIO $ traceEventIO ("START:" ++ tag)
    !res <- action
    liftIO $ traceEventIO ("END:" ++ tag)
    pure res

We can wrap parts of the flow we want to analyze with withTracingFlow using a tag to help us identify it.

End of Window

You can put the END of the window in different paths but ensure that all paths are covered:

  r <- f x
  case r of
    Just val -> do
      -- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
      -- Some processing
    Nothing -> do
      -- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
      -- Some processing

Measurement Overhead

Even when you are measuring an empty block of code there will be some minimum timing and allocations reported because of the measurement overhead.

    _ <- traceEventIO $ "START:emptyWindow"
    _ <- traceEventIO $ "END:emptyWindow"

The timing is due to the time measurement system call itself. The allocations are due to the traceEventIO haskell code execution. TODO: fix the allocations.

Measurement with Lazy Evaluation

If we want to measure the cost of the lookup in the code below we need to evaluate it right there:

    m <- readIORef _configCache
    return . snd $ SimpleLRU.lookup k m

For correct measurement use the following code:

    m <- readIORef _configCache
    _ <- traceEventIO $ "START:" ++ "mapLookup"
    let !v = HM.lookup k m
    _ <- traceEventIO $ "END:" ++ "mapLookup"
    return v

Labelling Threads

We should label our threads to identify the thread to scrutinize while reading the stats.

For example,

To scrutinize the main thread:

import GHC.Conc (myThreadId, labelThread)

main :: IO ()
main = do
    tid <- myThreadId
    labelThread tid "main-thread"
    withTracingFlow "main" $ do
       ...

To scrutinize the server thread in warp we can use the following middleware:

eventlogMiddleware :: Application -> Application
eventlogMiddleware app request respond = do
    tid <- myThreadId
    labelThread tid "server"
    traceEventIO ("START:server")
    app request respond1

    where

    respond1 r = do
        res <- respond r
        traceEventIO ("END:server")
        return res

We can use eventlogMiddleware as the outermost layer.

Reading the results

We get a lot of output currently. We are in the process of simplifying the statistics and making the details controllable via options.

Currently, the program prints a lot of information. It's essential to understand what to ignore given the use case.

The use-case we assume is: Understand the window CPU time and Thread allocated.

Consider the following program:

{-# LANGUAGE BangPatterns #-}

import Control.Monad (unless)
import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)
import GHC.Conc (myThreadId, labelThread)

{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
    liftIO $ traceEventIO ("START:" ++ tag)
    !res <- action
    liftIO $ traceEventIO ("END:" ++ tag)
    pure res

{-# INLINE printSumLoop #-}
printSumLoop :: Int -> Int -> Int -> IO ()
printSumLoop _ _ 0 = print "All Done!"
printSumLoop chunksOf from times = do
    withTracingFlow "sum" $ print $ sum [from..(from + chunksOf)]
    printSumLoop chunksOf (from + chunksOf) (times - 1)

main :: IO ()
main = do
    tid <- myThreadId
    labelThread tid "main-thread"
    withTracingFlow "main" $ do
         printSumLoop 10000 1 100

The statics gleaned from the eventlog of the above program will look like the following:

--------------------------------------------------
Summary Stats
--------------------------------------------------

Global thread wise stat summary
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread       2       967,479         434,384
  2           -       1         5,854          17,664

  -           -       3       973,333         452,048


Window [1:main] thread wise stat summary
ProcessCPUTime: 1,174,455
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 1,175,000

ThreadCPUTime:934,898
GcCPUTime:0
RtsCPUTime:239,557
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread       1       934,898         429,952

  -           -       1       934,898         429,952


Window [1:sum] thread wise stat summary
ProcessCPUTime: 953,862
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 949,000

ThreadCPUTime:833,991
GcCPUTime:0
RtsCPUTime:119,871
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread     100       833,991         328,224

  -           -     100       833,991         328,224


--------------------------------------------------
Detailed Stats
--------------------------------------------------

Window [1:main] thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 934,898     1 934,898 934,898 934,898      0


Grand total: 934,898

Window [1:main] thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 429,952     1 429,952 429,952 429,952      0


Grand total: 429,952

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714


Grand total: 833,991

Window [1:sum] thread wise stats for [ThreadAllocated]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 328,224   100 3,282   2,960  31,584  2,844


Grand total: 328,224

Global thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum  stddev
--- ----------- ------- ----- ------- ------- ------- -------
  1 main-thread 967,479     2 483,740  33,519 933,960 450,220
  2           -   5,854     1   5,854   5,854   5,854       0


Grand total: 973,333

Global thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum  stddev
--- ----------- ------- ----- ------- ------- ------- -------
  1 main-thread 434,384     2 217,192   4,920 429,464 212,272
  2           -  17,664     1  17,664  17,664  17,664       0


Grand total: 452,048

From the Global thread wise stat summary under Summary Stats figure out the thread id we want to scrutinize. In this case, we care about the main-thread. The thread id is 1.

We can skip to the Detailed Stats section.

We want to look at all the windows we want to scrutinize that run in the main-thread. The windows in the above program are main and sum. The thread id is prepended to the windows. So we want to look at sections corresponding to [1:main] and [1:sum].

That is,

Window [1:main] thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 934,898     1 934,898 934,898 934,898      0


Grand total: 934,898

Window [1:main] thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 429,952     1 429,952 429,952 429,952      0


Grand total: 429,952

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714


Grand total: 833,991

Window [1:sum] thread wise stats for [ThreadAllocated]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 328,224   100 3,282   2,960  31,584  2,844

Consider one specific section,

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714

This section is a table. It has 8 columns. It can have multiple rows. We should only scrutinize the row where the tid matches main-thread. ie. tid == 1.

The granularity of ThreadCPUTime is in nanoseconds and ThreadAllocated is in bytes.

Columns:

  • tid: The thread id
  • label: The thread label
  • total: The total accumulated sum of all the samples
  • count: Number of samples or the times this window is seen
  • avg: The average size of the samples
  • minimum: The minimum of all the samples
  • maximum: The maximum of all the samples
  • stddev: The standard deviation of the samples

NOTE: It is important to look at stddev. If stddev is more than 30% of the average and if the difference between the minimum and maximum is too much, the average might have unecessary outliers. In the future we would like to remove outliers automatically.