tuxedocomputers/tuxedo-control-center

TCC prevents dGPU from going to D3cold state

Closed this issue · 9 comments

PCI-Express Runtime D3 (RTD3) Power Management is a very important feature for achieving longer battery times (driver readme). While D0 being the highest power state, it may be switched to other states for power-saving. With RTD3, it is possible to power-off dGPU with D3cold state. Print the state via this command:

$ cat /sys/bus/pci/devices/0000:01:00.0/power_state

The problem is, TCC keeps the dGPU awake all the time. Exiting the TCC is not enough. I had run sudo service tccd stop to let the dGPU switch to D3cold state. After I started the tccd service again, it was still in D3cold, but after opening the TCC again it switched to D0 state.

This is frustrating because I lower CPU freq using TCC for battery saving, but it wastes energy by keeping the dGPU awake.

Maybe related to #341

Hello,

leaving the TCC dashboard should be enough to stop polling dGPU for info and let it go to d3cold. Going to the dashboard alone should also not wake the dGPU if at the time it's in d3cold.

Does this fit? Any other behaviour I would characterize as a bug.

For me opening the dashboard does change d3cold to d0 and keeps it there until
closing it and will go back to d3cold.

I checked with forkstat for catching newly created processes. I see this process is running routinely: /bin/sh -c nvidia-smi --query-gpu=power.draw,power.max_limit,enforced.power.limit,clocks.gr,clocks.max.gr --format=csv,noheader

And yes, the TCC GUI is running in the background. But it doesn't happen all the time. I think I found a behavior that may be important to discover the bug. Maybe you can also reproduce it:

  1. Open TCC.
  2. Wait for CPU and iGPU values to refresh.
  3. Switch to the dGPU tab.
  4. Wait for a few seconds and then close the window.
  5. Sometimes it keeps calling nvidia-smi on the background.

Screen record: https://www.youtube.com/watch?v=rfK6HyMgoCM

Maybe implementing a heartbeat technique for checking that hardware information subscribers are alive would be a solution. For example, if there is no heartbeat from the GUI for 10 seconds, the daemon can assume that the GUI is killed/crashed or unable to respond.

I checked the code but I really dont know Electron. However, after seeing many async functions this came to my mind as a solution.

The expected behavior is, that during initial startup the tcc will wake up the dGPU and thus metrics will be collected because the dGPU got woken up. The initial wakeup of the dGPU seems to happen because of electron and not because of the code itself. Once open, minimizing the window or leaving the dashboard to another part of the tcc should disable collection of metrics. Opening the minimized application will then not wake up the dGPU in the dashboard if it is in d3. Here some videos with 22.04.4 LTS via our FAI.

dashboard.mp4

It is odd that it collects data once the tcc is closed and that is indeed a bug. It is a bit hard to reproduce, but I can replicate it after several attempts. I need more time to analyze and think about it.

dashboard_nvidia.mp4

Did some testing and it looks like the UI wakes up the dGPU, but only "closes" it when u switch back to iGPU tab and close it.

exiting the app while dGPU tab is open doesn't stop polling the dGPU.
Either u have to kill the tccd daemon or start the UI and switch back to iGPU and exit UI

so I'd say this is a bug

but only "closes" it when u switch back to iGPU tab and close it.

The dashboard does not differentiate between which tab is visible. Data is collected for all gauges while the dashboard component is visible to ensure a seamless transition between tabs.

u have to kill the tccd daemon

That resets tccd to default values, which is off for dGPU data collection.

It is odd that it collects data once the tcc is closed and that is indeed a bug. It is a bit hard to reproduce, but I can replicate it after several attempts. I need more time to analyze and think about it.

As a small update, debugging was not easy because I could not consistently reproduce it. Adding more verbose debug logging seemingly fixed this issue, making it harder for me to analyze the code. The dashboard component is calling the required functions and turns off the data collection. However, sometimes the dbus does not show that a value was actually set. I think tcc terminates before the signal was sent to tccd, causing tcc to be unable to always turn collection off. It appears to be a race condition.

I have considered various solutions, and a fix should arrive soon. To summarize the current idea, I plan to wait in electrons close event for a tccd value to ensure the data collection status is set correctly in normal operation. Additionally, a timeout in tccd to automatically turn off data collection if the gpu dbus functions are not called, ensuring the status is maintained if tcc crashes or closes unexpectedly.

As a small update, I tried to put various things into the next release and it got a bit delayed. Maybe this week if things go well.

Should be fixed in 2.1.8.