oshaughn/research-projects-RIT

can we identify (and re-route) intrinsically CPU-intensive workflows?

Opened this issue · 1 comments

In this OSG Slack thread, after the PRP folks observed/complained that they were seeing unusually low GPU utilization by RIFT jobs, @astroclark explained:

turns out the latest batch of RIFT jobs that were submitted are intrinsically more CPU-intensive (a more expensive waveform approximant SEOBNRv4 fwiw). The waveform generation - CPU-bound - in this case is more expensive than the likelihood calculations - the GPU part. They're aware and agree that it would make more sense to run this type of job on CPUs.

This all makes sense and is not a problem, but sparks some questions and thoughts for me:

  1. How common is this (as a proportion of all the RIFT workflows you run, over time)?
  2. In practice, do/can you know in advance which runs which behave like this? Or is it something you can only really discover after you've run a workflow?
  3. Does it make sense to manually assign runs to CPUs or GPUs based on this knowledge, ad-hoc, or can it be done programmatically at either workflow generation-time or run-time, so humans aren't in the loop?
  4. Do you think it might be a good idea to instrument RIFT to collect some basic performance data while it runs, and then report, post-facto, the CPU and GPU utilization of each run as part of its results? I'm going to turn this last question into its own ticket (#16), because PyCBC did this a long time ago and it's been enormously helpful, and it allows you to effectively set alarms if things go outside expected bounds (e.g., CPU utilization approaches zero) and/or run reports on RIFT performance over time, have automated performance regression tests between RIFT versions, etc.

1,2- we should know in advance, this should only happen of order once for each configuration that better needs CPUs

3- the choice of CPU only or GPU only is done at setup time. Currently humans make that choice .

4- see other issue, no current tools to monitor GPU efficiently. Would be helpful if portable. Not instrument, but a parallel tool that users would run to monitor jobs