python/pyperformance

benchmark representing async-heavy, heterogeneous workload

belm0 opened this issue · 7 comments

belm0 commented

context: Recently I set about trying to find a faster Python interpreter for my async app, which has a high rate of context switching, many coroutines, and each coroutine running different code. I found that the app runs 2x slower on PyPy, and 20% faster on Pyston-full. One reason may be that a tracing JIT will naively trace across context switches, such that a given trace will never be repeated, due to arbitrary occurrence and ordering of coroutine resumes.

There really seems to be nothing in the benchmark world that represents this kind of async workload-- which, by the way, I expect to become more popular with time. The current async benchmarks in pyperformance have nothing like this-- the coroutines are trivial and homogenous.

It's concerning that the Pyston full fork is being retired, while PyPy blindly continues as if everything's OK, and faster-python proceeds at a furious pace-- all without evaluating a workload that is significant today, and may become more so in the future.

maybe can you propose a simple yet representative code for such a benchmark ?

belm0 commented

I'm not sure how to go about it, as taxing a tracing JIT would require a lot of code (I think PyPy gives up after a 6000 opcode trace). Some kind of code generation may need to be employed. I'm open to ideas.

Thanks for the suggestion! A benchmark like this would be valuable. It's mostly a matter of someone with enough knowledge taking the time to do it. It's relatively straight-forward to write a benchmark, (even outside the pyperformance repo), and pyperformance provides plenty of examples.

Also, it would be helpful to have both a workload-oriented benchmark and a feature-oriented one (macro vs. micro). There's more info about this on the faster-cpython wiki.

belm0 commented

It may be useful to write the benchmark against the anyio API, so that it could be run against both asyncio and trio.

belm0 commented

I'm thinking about what metrics of my app would most represent the workload. Then create a synthetic benchmark with similar metrics.

assuming a single threaded program, not hitting CPU limits, perhaps:

  • number of active tasks (coroutines)
  • task context switch rate
  • unique task context switch rate - given 1s sliding window of context switches, count of unique tasks (by underlying function) in that window
  • task step duration observations (for histogram) - wall time from task resume to yield
  • task step unique lines (i.e. discounting repeated execution)
belm0 commented

here's how my app looks by those metrics:

  max active tasks: 580
  context switch rate:  4,267 / s
  unique context switch rate (median, 1s window):  84 / s
  step duration: avg 119 µs +/- 347 µs
                 quartiles:  44.5 µs, 67.0 µs, 111 µs
  step unique lines: avg  164.9 +/-  115.1
                     quartiles:  87.5, 159.9, 191.4

"unique context switch rate" substantiates that the workload is heterogeneous. Each second, it touches about 80 different coroutine functions (coro.cr_code).

"step unique lines" substantiates that each coroutine step involves a significant amount of code to trace (avg and median are ~160 lines)

From the context switch rate and avg step duration, it's possible to derive approximate CPU load: 119e-6 * 4267 = 51%

(This is running on CPython 3.8. The durations are about 25% shorter on Pyston 3.8.)

belm0 commented

What's missing from the metrics, and especially relevant to JIT implementations, is some notion of the number of opcodes run per step. It should probably exclude repeated loop iterations-- so something like "step opcode line count".

--> I didn't go as far as counting opcodes, but I've revised previous posts to include unique line count per step