nh2/haskell-cpu-instruction-counter

Running several counters concurrently

zudov opened this issue · 3 comments

zudov commented

perf can only monitor a specific OS process running on specific (or all) CPU core. It's unaware of Haskell's RTS and OS threads.

I expect that running several counters concurrently may give strange confusing results. Running a test with counter at the same time with other (non cpu-instruction-counter) tests will also be confusing.

Now, some test runners (e.g. tasty) do parallel test execution by default. This may be a great source of confusion for an unaware user.

I have several ideas of ranging complexity that can help here, but ultimately we have to play around and investigate this.

  • Add a visible notice to README telling users not to run counters concurrently
  • Make a global lock that is taken by startInstructionCounter
    • if the lock is taken, next startInstructionCounter can fail with meaningful error message;
    • or it could just wait until the lock is released, which will sequentialize cpu-counter tests
    • BUT: all of this seems hacky and won't help if you concurrently run non cpu-instruction-counter tests
  • We can investigate how to actually make it work concurrently:
    • startInstructionCounter can return a Handle that will allow to work with this specific counter, tracking information related to it
    • we can use forkOn to run on specific capability which usually corresponds to a core. It's implementation dependent, but we only work on Linux so it's probably fine
      • but probably there's a more reliable way to fork onto specific core, I don't know
    • I scanned through a manpage and noticed interesting variables like PERF_SAMPLE_ID, PERF_FORMAT_ID, PERF_SAMPLE_GROUP, PERF_SAMPLE_ID. I din't look any closer yet, but maybe this can be used for reliably tracking several counters. This stackoverflow question may be related, but I didn't read closely.

I can only be sure about the first option (warn users in the README). In any case, cpu-instruction-counter is a thing that works only on Linux and uses FFI, so the best practice should be that all instruction counting tests/benchmarks live in separate executable, that's compiled with +RTS -N1 which eliminates the problem.

nh2 commented

@zudov Great points, thanks!

Add a visible notice to README telling users not to run counters concurrently

Good idea, I just did it with 077539f.

if the lock is taken, next startInstructionCounter can fail with meaningful error message

That sounds like a good idea until we have cleared up how exactly parallel usage behaves.

We probably want to do that locking against what's returned by perfEventOpenHwInstructions though. It is the one that chooses (in its C implementation) to record events for all threads. It would be legitimate to obtain an event FD that doesn't do that (e.g. one that only listens to events on a particular thread), and then call startInstructionCounter in parallel on two such event counters.

So I think in general best is to expose both an API that allows you to do everything conveniently from Haskell, and one that's safe to use against common errors (such as accidentally doing parallel perf invocations).

startInstructionCounter can return a Handle that will allow to work with this specific counter, tracking information related to it

That one I don't quite understand. The perfEventOpenHwInstructions is what returns such a handle.

we can use forkOn to run on specific capability which usually corresponds to a core. It's implementation dependent, but we only work on Linux so it's probably fine

This may not be sufficient in general. What happens if the forkOned f calls forkOn itself with another CPU (or just forkIO)?

the best practice should be that all instruction counting tests/benchmarks live in separate executable, that's compiled with +RTS -N1 which eliminates the problem.

That's not accurate:

Even with -N1 you may have 30 threads running. In -threaded each safe FFI call spawns a new pthread, no matter what you give for -N (see docs).

Only the non-threaded RTS provides the guarantee you're speaking of.

nh2 commented

There is another related topic that needs to be cleared up: #7

In weigh I handle this by launching the process n times per test, because I want a fresh cold process for each run. A suite-like interface to this library could do the same thing.