ROCm/ROCK-Kernel-Driver

[Issue]: Can not set fan speed on Radeon Pro W7900

Alic-Li opened this issue · 20 comments

Problem Description

Can not set fan speed on Radeon Pro W7900 , and it also can not set fan speed on GFX1100 such as RX-7900xtx, but GFX1030 could set successfully .The GPU temp will up to 80~90 ,the memmory temp will up to 100 ,and junction will up to 100,but it just have 50% fan speed. >_<

Operating System

Ubuntu 22.04.3 (jemmy jellyfish)

CPU

Intel i3-12100 with UHD730 Graphics

GPU

AMD Radeon Pro W7900

ROCm Version

ROCm 6.1.0

ROCm Component

amdsmi

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.13
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES

==========
HSA Agents


Agent 1


Name: 12th Gen Intel(R) Core(TM) i3-12100
Uuid: CPU-XX
Marketing Name: 12th Gen Intel(R) Core(TM) i3-12100
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4300
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65587452(0x3e8c8fc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65587452(0x3e8c8fc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65587452(0x3e8c8fc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx1100
Uuid: GPU-ed466fc6e51f9536
Marketing Name: AMD Radeon PRO W7900
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29768(0x7448)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1760
BDFID: 768
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 92
SDMA engine uCode:: 20
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 47169536(0x2cfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 47169536(0x2cfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

Additional Information

No response

The rocm-smi also can not get Partitions temp such as memory and compute temp, but the sensors command could get it.
截图 2024-05-19 10-09-25

Note that for your 2nd message that's partition mode, not temperature. You aren't using a GPU that supports memory or compute partitions.

As for fan speed, what happens if you try to manually change the temperature? Does dmesg throw any errors? Does it seem to work (IE no errors) but the fan doesn't change? Does it work if performance mode is set to auto and not manual? And if you set it to 100 and read it back, does it still return 50% as its value?

Thanks for your assistant. When I change the fan speed , the rocm-smi will output set fan speed succeed, but the hardware fan still work default. The dmesg doesn't throw any errors. The card still work but fan doesn't change,I always use performance mode and it not auto. I set it to 100 and read it back the fan speed still keep auto the highest speed just still 50% left and right.

One more serious problem. I write the "rocm-smi --setsclk 2" command in boot service ,because it could limit the GPU clock. If I doens't use the command that means I set default. Some time the GPU clock will up to 150% (3ghz) and It will cause the computer force poeroff reboot.Please fix it! sincere gratitude!

Here is the out put_________________But fan speed still auto
截图 2024-05-23 10-08-21

OK so I see it saying that it set it to 100% there. Does it say it's at 100% after you run "rocm-smi" after but it's running slow, or does it just stay at the lower speed while reporting that lower speed? Does dmesg say anything after you've done that command?

Yes, it just stay at the lower speed while reporting that lower speed.

dmesg say : "[17629.054891] amdgpu: manual fan speed control should be enabled first"

How can I change the "/sys/class/drm/card0/device/drm/card0/device/hwmon/hwmon2/pwm1_enable" value? I think that is the problem is.

Can you echo "1" to pwm1_enable first? If that works, then it looks like the SMI tool has a bug where it's not setting "manual? to the pwm1_enable file first before trying to change the value. if we do that, then we should be good to set it to the value that you desire.

Sorry I used try it , but I can't change the value in /sys file system.(I use the root user identity to change the value) I think it was generate by driver or changed by rocm-smi. Thanks for you reply!

I use the LACT tool but it also have BUG like our meet (ilya-zlobintsev/LACT#255) . That look like we meet same problem of this, we need some one fix it. Thanks~

The file is created by amdgpu:
https://github.com/ROCm/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/amd/pm/amdgpu_pm.c#L3284
When you echo to it, try use tee instead of a straight pipe. e.g.

$ cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
2
$ echo 1 | sudo tee /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
1
$ cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
1
$ echo 2 | sudo tee /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
2
$ cat /sys/class/drm/card0/device/hwmon/hwmon2/pwm1_enable
2

If it fails, check if dmesg says why or if it returns a value like -22 (which means the driver thinks that fan control isn't supported on the device)

Sorry It dosn't work......
截图 2024-05-31 21-50-27

Does dmesg say anything as to why? Ideally there would be a message there to say what's happening. Grabbing and attaching the full dmesg, from boot to the failed attempt to change the fans, would help. Maybe something showed up during device init, or after you tried to set the fans, to give us a clue as to what's up.

Here is the dmesg about amdgpu when I was reboot just now.

alic-li@alic-li-B660M-D2H-DDR4:~$ sudo dmesg | grep "amdgpu"
[ 6.569868] [drm] amdgpu kernel modesetting enabled.
[ 6.569955] amdgpu: CRAT table disabled by module option
[ 6.569957] amdgpu: Virtual CRAT table created for CPU
[ 6.569967] amdgpu: Topology: Add CPU node
[ 6.570073] amdgpu 0000:03:00.0: enabling device (0006 -> 0007)
[ 6.573665] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT
[ 6.573667] amdgpu: ATOM BIOS: 113-D7070100-138
[ 6.577267] amdgpu 0000:03:00.0: amdgpu: CP RS64 enable
[ 6.580762] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
[ 6.582366] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[ 6.582368] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 6.582406] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active.
[ 6.582406] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is not presented.
[ 6.582410] amdgpu 0000:03:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[101] ras_mask[101]
[ 6.582453] amdgpu 0000:03:00.0: BAR 2: releasing [mem 0x6010000000-0x60101fffff 64bit pref]
[ 6.582455] amdgpu 0000:03:00.0: BAR 0: releasing [mem 0x6000000000-0x600fffffff 64bit pref]
[ 6.582477] amdgpu 0000:03:00.0: BAR 0: assigned [mem 0x5000000000-0x5fffffffff 64bit pref]
[ 6.582483] amdgpu 0000:03:00.0: BAR 2: assigned [mem 0x4800000000-0x48001fffff 64bit pref]
[ 6.582521] amdgpu 0000:03:00.0: amdgpu: VRAM: 46064M 0x0000008000000000 - 0x0000008B3EFFFFFF (46064M used)
[ 6.582523] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 6.582524] amdgpu 0000:03:00.0: amdgpu: AGP: 267862016M 0x0000008C00000000 - 0x0000FFFFFFFFFFFF
[ 6.582672] [drm] amdgpu: 46064M of VRAM memory ready
[ 6.582674] [drm] amdgpu: 32025M of GTT memory ready.
[ 6.583598] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[ 6.798558] amdgpu 0000:03:00.0: amdgpu: GECC is enabled
[ 6.815399] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 6.815404] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 6.815444] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x004e7c00 (78.124.0)
[ 6.815455] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[ 6.982604] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!
[ 7.263871] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 7.474806] amdgpu: HMM registered 46064MB device memory
[ 7.475746] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 7.475748] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 7.475772] amdgpu: Virtual CRAT table created for GPU
[ 7.475895] amdgpu: Topology: Add dGPU node [0x7448:0x1002]
[ 7.475896] kfd kfd: amdgpu: added device 1002:7448
[ 7.475908] amdgpu 0000:03:00.0: amdgpu: SE 6, SH per SE 2, CU per SH 8, active_cu_number 96
[ 7.475963] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 7.475964] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 7.475964] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 7.475965] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 7.475965] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 7.475966] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 7.475967] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 7.475967] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 7.475968] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 7.475968] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 7.475969] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 7.475970] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 7.475970] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[ 7.475971] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[ 7.475971] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[ 7.479046] amdgpu 0000:03:00.0: amdgpu: Using BACO for runtime pm
[ 7.479344] [drm] Initialized amdgpu 3.54.0 20150101 for 0000:03:00.0 on minor 0
[ 7.485295] fbcon: amdgpudrmfb (fb0) is primary device
[ 7.485297] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[ 11.209759] amdgpu 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[ 11.758890] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[ 25.257576] amdgpu: manual fan speed control should be enabled first

So I managed to find a NV31 internally, and can reproduce the same as you have there. I enabled some additional logging and found that the SMU isn't reporting WHY it can't do it, just that it isn't doing it. @ppanchad-amd Can we make an internal JIRA for this and assign it to the SMU team for Navi31? Thanks!

@Alic-Li @kentrussell Internal ticket has been created to investigate this issue. Thanks!

Sure ! thanks for you help, I'll wait for you good news.Waiting for the updating😉