ARM-software/workload-automation

Regarding Runtime Parameter

hosunhc opened this issue · 19 comments

The device that I am using has three clusters as shown in the device_config below:

  device_config:
    adb_server:
    big_core:
    core_clusters:
    core_names: ['A55', 'A55', 'A55', 'A55', 'A76', 'A76', 'X1', 'X1']

And if I try to change the frequency of cluster A76 when CPU4 is off, WA returns an error saying that it is not possible due to CPU4 being off even though CPU5, a cpu in the same cluster, is on:

  runtime_parameters:
    A55_frequency: 1328000
    A76_frequency: 1328000
    X1_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 0

Is there no way around this? Is it because CPU5 frequency is fixed to CPU4? Any advice is appreciated. The device is Pixel 6.

Hi, thanks for reporting this, that should not be the case so sounds like we might have a bug somewhere.

As a workaround could you try explicitly specifying the frequency of the enabled cores that you are looking for and see if that allows you to make progress?

e.g.

cpu2_frequency: 1328000
cpu5_frequency: 1328000
cpu6_frequency: 1745000

Thanks for the quick response. Still does not seem to work:


workloads:
- name: stress-ng
  iterations: 10
  params:
    cleanup_assets: true
    duration: 10
    extra_args: '--cpu-method callfunc --taskset 6,7 -l 100'
    stressor: cpu
    threads: 2
    uninstall: false
  runtime_parameters:
    # A55_frequency: 1328000
    # A76_frequency: 1328000
    # X1_frequency: 1745000
    cpu2_frequency: 1328000
    cpu5_frequency: 1328000
    cpu6_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 1

With the output as below:

INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. retrying...
INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. retrying...
INFO     Running job wk1
INFO         Configuring augmentations
INFO         Configuring target for job wk1 (stress-ng) [1]
ERROR        Cannot configure frequencies for CPU4 as no CPUs are online.
INFO         Completing job wk1
ERROR    Job wk1 iteration 1 completed with status FAILED. Max retries exceeded.

Hmm.. I see. It seems like this is happening because WA is resolving to the first cpu in the cluster and incorrectly not checking to find the first "online" cpu in the cluster.

If you don't have the requirement for particular cpus and only the number online per cluster, one potential workaround may be to online the first cpu of each cluster and hopefully allow WA's resolution to function as intended.
E.g. for your first example:

    sysfile_values:
      /sys/devices/system/cpu/cpu0/online: 1
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 1
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 1
      /sys/devices/system/cpu/cpu5/online: 0
      /sys/devices/system/cpu/cpu6/online: 1
      /sys/devices/system/cpu/cpu7/online: 0

Ahhhh i see, I was hoping that that wasnt the case as I would prefer having the flexibility of particular cpus

I think I've found the problem (and a few others in the process). Would you be able to try out this [1] branch on your setup and let me know if this resolves the issue for you?

[1] https://github.com/marcbonnici/workload-automation/tree/cpu_domain_fix

Okay, so I switched branches, and i just used the setup.py and followed the installation with:

cd workload-automation
sudo -H python setup.py install

And the given version is 3.4.0.dev1+7c432d74. but the issue still seems to occur.

Hmm.. thanks for trying that out.
Do you have your run.log available to see if there are any further hints in there?

run.log
The workload agenda is here:

workloads:
- name: stress-ng
  iterations: 5
  params:
    cleanup_assets: true
    duration: 10
    extra_args: '--cpu-method gcd --taskset 5,7 -l 100'
    stressor: cpu
    threads: 2
    uninstall: false
  runtime_parameters:
    A55_frequency: 1328000
    A76_frequency: 1328000
    X1_frequency: 1745000
    sysfile_values:
      /sys/devices/system/cpu/cpu1/online: 0
      /sys/devices/system/cpu/cpu2/online: 0
      /sys/devices/system/cpu/cpu3/online: 0
      /sys/devices/system/cpu/cpu4/online: 0
      /sys/devices/system/cpu/cpu5/online: 1
      /sys/devices/system/cpu/cpu6/online: 0
      /sys/devices/system/cpu/cpu7/online: 1

Thanks, would you be able to pull my branch again and see if this resolves this problem for you?

Still seems to be happening.
run.log
Also in case you need the agenda:
stressng_w_10iter.txt

Hi Honsunhc - what happens if you try to explicitly set the frequency for each online CPU, rather than the cluster frequency?

e.g

  runtime_parameters:
    cpu0_frequency: 1328000
    cpu5_frequency: 1328000
    cpu7_frequency: 1745000

Right it looks like next issue here is that WA queries the device at the time it validates the input parameters and this can change before they are committed to the device.

At the point the cluster A76 (for example) will resolve to both cpus 4 and 5 (if both are online at that time) so WA picks the first cpu and hence is later generating the error since as part of the sysfile setting that cpu is being turned off before WA can actually commit the frequency.

I think Scotts workaround should work as it doesn't not rely on this resolution, however I've also updated my branch again to change the order the sysfile runtime parameters are set on the device so that any frequency configuration happens before we offline cpus. Would you be able to check if this one gets things working for you?

So I tried both Scotts method and the normal cluster method, and they both work great! There was one instance using the A76 method where the first iteration ran fine but then the remaining four iterations did have the same CPU issues, but this only happened once. If that error persists, I'll open a new issue, but at the moment I think its fixed! Thanks!

Thanks for confirming, I'm glad we finally have a working setup for you.

I think I might know what could cause the issue with the cluster approach but would need to look into this further so I'll keep this issue open for now as well.

So it seems that this could be a more persistent issue.
I attached the run log below:
run.log

I think the issue here is the cluster names combined with the hotplugging and iterations, the resolution of the cpus is still being performed at the start of the run and when trying to configure the device on subsequent iterations we run into the same problem.

Does using the cpuX_frequency notation still work here?

Yep, using cpuX_frequency works great.

Ok thanks for confirming. I looks like to solve the cluster parameters in combination with hotplugging the runtime parameter mechanism would require some more invasive changes.

Just to double check, are you still using my topic branch to get things working on your end rather than the upstream implementation? If so I'll look at merging those changes so we at least have a workable solution upstream as well.

Yep, I've been using your branch rather than the upstream implementation.