hbriese/fancon

Start with root/systemd unit fails

artemklevtsov opened this issue · 14 comments

Describe the bug

Failed with NVIDIA control by root user.

# LANG=C fancon
No protocol specified
20/05/06 20:27 [109117] <warning> $XAUTHORITY env variable(s) not set, set to enable NVIDIA support
20/05/06 20:27 [109117] <fatal> X11Display must be connected
terminate called after throwing an instance of 'std::exception'
  what():  std::exception

But I can't run as normal user:

$ funcon
20/05/06 20:32 [109236] <fatal> Must be run as root

Seems service which needs running the Xorg should be running by non root user.

Steps to Reproduce

  • build fancon with NVIDIA support
  • run fancon by root ro with systemd unit file

Expected behavior

Run without errors with systemd unit file.

Additional context

Config:

config {
  update_interval: 1000
  dynamic: true
  smoothing_intervals: 3
  top_stickiness_intervals: 2
  temp_averaging_intervals: 3
}
devices {
  fan {
    type: SYS
    label: "hwmon1/fan2"
    sensor: "hwmon0/Tdie"
    temp_to_rpm: "40: 0%, 50: 20%, 60: 50%, 70: 80%, 80: 100%"
    rpm_to_pwm: "0: 30, 175: 34, 181: 36, 203: 38, 212: 40, 219: 42, 239: 44, 249: 46, 261: 48, 262: 50, 266: 52, 283: 54, 297: 56, 310: 58, 320: 60, 336: 62, 349: 64, 352: 66, 370: 68, 377: 70, 398: 72, 406: 74, 418: 76, 424: 78, 451: 80, 466: 82, 476: 84, 496: 86, 498: 88, 499: 90, 518: 92, 533: 94, 543: 96, 547: 98, 569: 100, 573: 102, 584: 104, 597: 106, 606: 108, 625: 110, 633: 112, 639: 114, 648: 116, 656: 118, 669: 120, 685: 122, 695: 124, 699: 126, 717: 128, 722: 130, 727: 132, 734: 134, 754: 136, 758: 138, 779: 140, 781: 142, 790: 144, 791: 146, 810: 148, 821: 150, 841: 152, 849: 154, 856: 156, 861: 158, 867: 160, 882: 162, 889: 164, 903: 166, 916: 168, 950: 170, 954: 172, 964: 174, 968: 176, 976: 178, 982: 180, 984: 182, 1000: 184, 1021: 186, 1038: 188, 1040: 190, 1042: 192, 1044: 194, 1051: 196, 1065: 198, 1094: 200, 1102: 202, 1109: 204, 1116: 206, 1128: 208, 1129: 210, 1147: 212, 1149: 214, 1159: 216, 1171: 218, 1185: 220, 1190: 222, 1198: 224, 1217: 226, 1220: 228, 1236: 230, 1250: 232, 1256: 234, 1261: 236, 1274: 238, 1275: 242, 1299: 244, 1305: 246, 1309: 250, 1323: 252, 1337: 254, 1356: 255"
    pwm_path: "/sys/class/hwmon/hwmon1/pwm2"
    rpm_path: "/sys/class/hwmon/hwmon1/fan2_input"
    enable_path: "/sys/class/hwmon/hwmon1/pwm2_enable"
    driver_flag: 5
  }
  fan {
    type: NVIDIA
    label: "1660_SUPER"
    sensor: "1660_SUPER_temp"
    temp_to_rpm: "60: 0%, 70: 30%, 80: 50%, 90: 100%"
    rpm_to_pwm: "1315: 102, 2454: 188, 2646: 204, 3303: 255"
    start_pwm: 102
  }
  sensor {
    type: NVIDIA
    label: "1660_SUPER_temp"
  }
  sensor {
    label: "hwmon0/Tdie"
    input_path: "/sys/class/hwmon/hwmon0/temp1_input"
    max_path: "/sys/class/hwmon/hwmon0/temp1_max"
  }
}

Try the latest commit. It should fix the crash but may result in NVIDIA control being disabled.
Please run sudo fancon log-lvl=debug

Then try:
fancon requires X11 access due to LibNVCtrl (NVIDIA control), but opening an X11 display requires XAUTHORITY and DISPLAY to be set. XAUTHORITY is set by the display manager (gdm, lightdm etc.)

Manually configure the unset environmental variable(s)

Inside /etc/profile

  • export XAUTHORITY=...; You can find the XAuthority file by running xauth info
  • xhost si:localuser:root; May be necessary on Wayland

https://wiki.archlinux.org/index.php/Running_GUI_applications_as_root

Thank for the quick reply and fixes. I will test it soon. But now I want to note that fancon should starts after Xorg (display manager) only if it needs control over NVIDIA.

Seems my system have not the graphical-session.target. So I got the following error:

$ sudo systemctl start fancon.service
Failed to start fancon.service: Unit graphical-session.target not found.

Some details:

$ file /etc/systemd/system/display-manager.service
/etc/systemd/system/display-manager.service: symbolic link to /usr/lib/systemd/system/sddm.service
$ systemctl list-units --type=target
  UNIT                  LOAD   ACTIVE SUB    DESCRIPTION                  
  basic.target          loaded active active Basic System                 
  cryptsetup.target     loaded active active Local Encrypted Volumes      
  getty.target          loaded active active Login Prompts                
  graphical.target      loaded active active Graphical Interface          
  local-fs-pre.target   loaded active active Local File Systems (Pre)     
  local-fs.target       loaded active active Local File Systems           
  multi-user.target     loaded active active Multi-User System            
  network-online.target loaded active active Network is Online            
  network.target        loaded active active Network                      
  nss-lookup.target     loaded active active Host and Network Name Lookups
  paths.target          loaded active active Paths                        
  rpc_pipefs.target     loaded active active rpc_pipefs.target            
  rpcbind.target        loaded active active RPC Port Mapper              
  slices.target         loaded active active Slices                       
  sockets.target        loaded active active Sockets                      
  sound.target          loaded active active Sound Card                   
  swap.target           loaded active active Swap                         
  sysinit.target        loaded active active System Initialization        
  timers.target         loaded active active Timers  

Can I start service as normal user with systemctl --user?

Try the nvidia-testing branch, it adds an additional reload step for nvidia devices once the (now) graphical.target is triggered. Hopefully graphical.target is sufficient to open the X display without issues. Unfortunately I no longer have a nvidia gpu so I have been unable to test for a while, so thanks for all your help.

root is required for changing fan speeds.

I tried the nvidia-testing branch. The fancon.service still crashed:

$ journalctl -u fancon.service -o cat | tail
Started fancon.
No protocol specified
<warning> $XAUTHORITY env variable(s) not set, set to enable NVIDIA support
<fatal> X11Display must be connected
terminate called after throwing an instance of 'std::exception'
  what():  std::exception
fancon.service: Main process exited, code=dumped, status=6/ABRT
fancon.service: Failed with result 'core-dump'.

I can investigate the specifics of the crash with that core dump. Also if you could provide the output of fancon -i

As it seems you're also on Wayland you will need to add xhost +si:localuser:root to your /etc/profile as specified above, if not done already

I use KDE with SDDM (Xorg session). Still crash with xhost:

$ xhost si:localuser:root
localuser:root being added to access control list
$ LANG=C sudo su - -c 'fancon -v'
20/05/08 22:45 [179770] <debug> Guessing X11 env var $DISPLAY, consider setting
20/05/08 22:45 [179770] <debug> Guessing X11 env var $XAUTHORITY, consider setting
20/05/08 22:45 [179770] <warning> $XAUTHORITY env variable(s) not set, set explicitly to enable NVIDIA control
20/05/08 22:45 [179770] <debug> X11 display cannot be opened
20/05/08 22:45 [179770] <warning> NVIDIA sensor configured but NVIDIA control is disabled at this time
20/05/08 22:45 [179770] <fatal> X11 display couldn't be opened but is being used anyway!
terminate called after throwing an instance of 'std::exception'  std::exception
Aborted (core dumped)

With root and XAUTHORITY:

$ LANG=C sudo su -c 'export XAUTHORITY=/tmp/xauth-1000-_0; fancon -v'

WARNING: Unable to locate/open X configuration file.

nvidia-xconfig could not be found, either install it, or set the coolbits value manually
20/05/08 22:55 [182003] <error> Failed to query number of NVIDIA GPUs
20/05/08 22:55 [182003] <error> Failed to query number of NVIDIA GPUs
Starting controller
20/05/08 22:55 [182003] <warning> hwmon1/fan1: skipping - curve not configured & sensor not configured & 
20/05/08 22:55 [182003] <warning> hwmon1/fan3: skipping - curve not configured & sensor not configured & 

Why can not we skip the NVIDIA init when Xorg is not available? If I understand correct the fancon.service should be started early without Xorg but it crash now.

Please re-test. Either way the program shouldn't crash when the init fails.
Eventually I hope to replace XNVCtrl with nvidia's newer library NVML which doesn't depend on X11, but unfortunately it's been incomplete for a couple of years

Sounds good about NVML.

I tried to enable fancon.service and fancon-nvidia.service. Here log after reboot:

$ LANG=C journalctl -b -u fancon.service
-- Logs begin at Sat 2020-05-09 11:34:25 +07, end at Sun 2020-05-10 19:28:32 +07. --
May 10 19:24:36 unikum-desktop systemd[1]: Started fancon.
May 10 19:24:36 unikum-desktop systemd[1]: fancon.service: Main process exited, code=killed, status=12/USR2
May 10 19:24:36 unikum-desktop systemd[1]: fancon.service: Failed with result 'signal'.
May 10 19:24:38 unikum-desktop systemd[1]: fancon.service: Scheduled restart job, restart counter is at 1.
May 10 19:24:38 unikum-desktop systemd[1]: Stopped fancon.
May 10 19:24:38 unikum-desktop systemd[1]: Started fancon.
May 10 19:24:38 unikum-desktop fancon[828]: <warning> NVIDIA sensor configured but NVIDIA control is disabled at this time
May 10 19:24:38 unikum-desktop fancon[828]: <warning> NVIDIA fan is configured but NVIDIA control is disabled at this time
May 10 19:24:38 unikum-desktop fancon[828]: Starting controller
May 10 19:24:38 unikum-desktop fancon[828]: <warning> hwmon1/fan1: skipping - curve not configured & sensor not configured
May 10 19:24:38 unikum-desktop fancon[828]: <warning> hwmon1/fan3: skipping - curve not configured & sensor not configured
$ LANG=C journalctl -b -u fancon-nvidia.service
-- Logs begin at Sat 2020-05-09 11:34:25 +07, end at Sun 2020-05-10 19:32:28 +07. --
May 10 19:24:36 unikum-desktop systemd[1]: Starting Reload fancon once NVIDIA control is possible...
May 10 19:24:36 unikum-desktop systemd[1]: fancon-nvidia.service: Succeeded.
May 10 19:24:36 unikum-desktop systemd[1]: Finished Reload fancon once NVIDIA control is possible.

Seems this appears when fancon-nvidia.service starts:

fancon.service: Main process exited, code=killed, status=12/USR2

How can I check that NVIDIA fan controlled by fancon?

fancon_system_info.txt

Monitoring with NVML implemented in this repo: https://github.com/oblalex/nvidia-gpu-monitoring
It may be helpful.

Here is toy example:

main.cpp:

#include <iostream>
#include <nvml.h>

void raise_nv_status(const nvmlReturn_t& st) {
    if (st != nvmlReturn_t::NVML_SUCCESS) {
        std::runtime_error(nvmlErrorString(st));
    }
}

unsigned int get_fan_speed(unsigned int idx) {
    nvmlReturn_t nv_status;
    nvmlDevice_t handle;
    nv_status = nvmlDeviceGetHandleByIndex(idx, &handle);
    raise_nv_status(nv_status);
    unsigned int fan_speed = 0;
    nv_status = nvmlDeviceGetFanSpeed(handle, &fan_speed);
    raise_nv_status(nv_status);
    return fan_speed;
}

int main(int argc, char *argv[]) {
    nvmlReturn_t nv_status;
    nv_status = nvmlInit();
    raise_nv_status(nv_status);
    unsigned int device_count = 0;
    nv_status = nvmlDeviceGetCount(&device_count);
    raise_nv_status(nv_status);    
    std::cout << "Number of devices: " << device_count << std::endl;
    std::cout << "Fun speeds:" << std::endl;
    for (unsigned int i = 0; i < device_count; ++i) {
        unsigned int speed = get_fan_speed(i);
        std::cout << "  Device" << i << ": " << speed << std::endl;
    }
    return 0;
}

CMakeLists.txt:

cmake_minimum_required(VERSION 3.0)

project(nvtest)

add_executable(nvtest main.cpp)
include_directories(/opt/cuda/targets/x86_64-linux/include)
target_link_libraries(nvtest nvidia-ml)

install(TARGETS nvtest RUNTIME DESTINATION bin)

Works with root without Xorg.

sudo su - -c 'LANG=C /home/unikum/Projects/CPP/nvtest/build/nvtest'
Number of devices: 1
Fun speeds:
  Device0: 0

Unfortunately it's missing crucial functionality, namely nvmlDeviceSetFanSpeed; see https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html

Their own controller is still using XNVCtrl - https://github.com/NVIDIA/nvidia-settings/blob/master/src/libXNVCtrlAttributes/NvCtrlAttributesNvml.c

There even exists some nvml code in fancon already but it's disabled by default due to requiring XNVCtrl anyway.

Thank you for the explanation. Now it's clear for me.

Seems to work without any pain we should separate the nvidia related code and run it with systemctl --user after Xorg starts.