[Issue]:Install amdgpu-dkms 1:6.8.5.60200-2009582.24.04 on Radeon Pro W7900 lead to the operate system serious crash about ADM GPU Driver
Closed this issue · 16 comments
Problem Description
About five days ago I receive the update from the repositories of Radeon . So I update the amdgpu-dkms for my Ubuntu 22.04 , but unfortunately , this update demolish my operate system. The specific symptoms are almost cannot enter to my Gnome-desktop. And when I enter desktop hardly , I see my Gnome-desktop flicker and cannot open any software . So I switched to openSUSE, updated openSUSE's amdgpu-dkms, and then the kde desktop encountered the same problem. Then I backed up my data, reinstalled Ubuntu and openSUSE, and rebuilt the production environment. As a result, I encountered the problem again when installing the driver. I tried to reinstall Gnome-desktop, but it didn't work. When I reboot the operate system, I encountered the following photos . I reinstall the Ubuntu 22.04 ,Ubuntu 24.04 and openSUSE
but it did not solve the problem.
Operating System
Ubuntu 22.04.3 (jemmy jellyfish) | openSUSE Tumbleweed | Ubuntu 24.04 LTS
CPU
Intel I3-12100 with UHD 730
GPU
AMD Radeon Pro W7900
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
sudo apt-get install amdgpu-dkms
sudo reboot
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
alic-li@alic-li-B660M-D2H-DDR4:~$ rocminfo
ROCk module version 6.8.5 is loaded
HSA System Attributes
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: 12th Gen Intel(R) Core(TM) i3-12100
Uuid: CPU-XX
Marketing Name: 12th Gen Intel(R) Core(TM) i3-12100
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 4300
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65595952(0x3e8ea30) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65595952(0x3e8ea30) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65595952(0x3e8ea30) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx1100
Uuid: GPU-ed466fc6e51f9536
Marketing Name: AMD Radeon PRO W7900
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29768(0x7448)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1760
BDFID: 768
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 232
SDMA engine uCode:: 21
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 47169536(0x2cfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 47169536(0x2cfc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
In addition, the fan speed of w7900 still cannot be adjusted. I updated to the latest amdgpu-dkms driver and adjusted the speed in the tty5 interface, but it still cannot be adjusted.
@Alic-Li Internal ticket has been created to investigate this issue. Thanks!
Can you attach a full dmesg, ideally after trying to set the fan as well? That way we can see any issues during init and the display coming up, as well as the fan messages (if they appear)
Hi kentrussell ! Thanks for you reply ! here is the full dmesg about during set the fan.
alic-li@alic-li-B660M-D2H-DDR4:$ sudo rocm-smi --setfan 255$ sudo dmesg | grep "amd"
#============================ ROCm System Management Interface ============================
#=================================== Set GPU Fan Speed ====================================
#GPU[0] : Successfully set fan speed to level 255
#==========================================================================================
#================================== End of ROCm SMI Log ===================================
alic-li@alic-li-B660M-D2H-DDR4:
[ 0.000000] Linux version 6.8.0-40-generic (buildd@lcy02-amd64-075) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #40-Ubuntu SMP PREEMPT_DYNAMIC Fri Jul 5 10:34:03 UTC 2024 (Ubuntu 6.8.0-40.40-generic 6.8.12)
[ 5.023750] amdkcl: loading out-of-tree module taints kernel.
[ 5.023753] amdkcl: module verification failed: signature and/or required key missing - tainting kernel
[ 6.851379] [drm] amdgpu kernel modesetting enabled.
[ 6.851382] [drm] amdgpu version: 6.8.5
[ 6.851484] amdgpu: Virtual CRAT table created for CPU
[ 6.851491] amdgpu: Topology: Add CPU node
[ 6.853136] amdgpu 0000:03:00.0: enabling device (0006 -> 0007)
[ 6.857377] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT
[ 6.857379] amdgpu: ATOM BIOS: 113-D7070100-138
[ 6.861132] amdgpu 0000:03:00.0: amdgpu: CP RS64 enable
[ 6.866216] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_early_init [amdgpu]] JPEG decode is enabled in VM mode
[ 6.879932] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[ 6.879935] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[ 6.879961] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active.
[ 6.879962] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is not presented.
[ 6.879968] amdgpu 0000:03:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[101] ras_mask[101]
[ 6.879988] amdgpu 0000:03:00.0: BAR 2 [mem 0x6010000000-0x60101fffff 64bit pref]: releasing
[ 6.879990] amdgpu 0000:03:00.0: BAR 0 [mem 0x6000000000-0x600fffffff 64bit pref]: releasing
[ 6.880013] amdgpu 0000:03:00.0: BAR 0 [mem 0x5000000000-0x5fffffffff 64bit pref]: assigned
[ 6.880020] amdgpu 0000:03:00.0: BAR 2 [mem 0x4800000000-0x48001fffff 64bit pref]: assigned
[ 6.880056] amdgpu 0000:03:00.0: amdgpu: VRAM: 46064M 0x0000008000000000 - 0x0000008B3EFFFFFF (46064M used)
[ 6.880058] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x00007FFF00000000 - 0x00007FFF1FFFFFFF
[ 6.880122] [drm] amdgpu: 46064M of VRAM memory ready
[ 6.880124] [drm] amdgpu: 32029M of GTT memory ready.
[ 6.957150] amdgpu 0000:03:00.0: amdgpu: reserve 0x1300000 from 0x8b3c000000 for PSP TMR
[ 7.097529] amdgpu 0000:03:00.0: amdgpu: GECC is enabled
[ 7.114392] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 7.114396] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 7.114433] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x00000040, smu fw program = 0, smu fw version = 0x004e7e00 (78.126.0)
[ 7.114443] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[ 7.280358] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!
[ 7.546739] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 7.810808] amdgpu: HMM registered 46064MB device memory
[ 7.811871] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[ 7.811882] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[ 7.811909] amdgpu: Virtual CRAT table created for GPU
[ 7.812031] amdgpu: Topology: Add dGPU node [0x7448:0x1002]
[ 7.812032] kfd kfd: amdgpu: added device 1002:7448
[ 7.812043] amdgpu 0000:03:00.0: amdgpu: SE 6, SH per SE 2, CU per SH 8, active_cu_number 96
[ 7.812046] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 7.812047] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 7.812048] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 7.812048] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 7.812049] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 7.812050] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 7.812050] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 7.812051] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 7.812051] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 7.812052] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 7.812052] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 7.812053] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 7.812054] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[ 7.812054] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[ 7.812055] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[ 7.815559] amdgpu 0000:03:00.0: amdgpu: Using BAMACO for runtime pm
[ 7.815855] [drm] Initialized amdgpu 3.58.0 20150101 for 0000:03:00.0 on minor 1
[ 7.822019] fbcon: amdgpudrmfb (fb0) is primary device
[ 7.822022] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[ 10.563346] amdgpu 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[ 10.646171] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[ 465.781001] amdgpu: manual fan speed control should be enabled first
[ 545.327324] amdgpu: manual fan speed control should be enabled first
[ 592.080974] amdgpu: manual fan speed control should be enabled first
###I set it three times###
So I don't see anything obvious for the flickering screen there, but I am less of a graphics guy (the internal ticket should be able to make progress there). For the fan, If you try to just set the fan speed to manual without setting a value, does it stay as "auto"? You can do it manually via:
cd /sys/class/drm/card0/device/hwmon
cd (on my test machine it's hwmon2 but it depends on your system config)
cat ./pwm1_enable
(Manual=1, Auto=2, off=0)
Then try to set it to manual by>
echo 1|sudo tee ./pwm1_enable
Then verify it with
cat ./pwm1_enable
If it stays at 2, it is likely that the firmware isn't actually changing the setting (and isn't giving us an error to say why). The internal ticket should be able to verify that pretty quickly. If it does change to 1, then maybe there's a bug in the SMI where it's not setting fan control to manual before trying to change the speed
By the way , I finally figure out the resons of install amdgpu-dkms 1:6.8.5.60200-2009582.24.04 on Radeon Pro W7900 lead to the operate system serious crash about ADM GPU Driver. When I fix my operate system. I try to reinstall the amdgpu-kms but it didn't work. but , when I ovewrite install the amd open source gpu driver. Than the miracle was happened, The gnome desktop environment is rely on the amd opensoure gpu driver. So I solved this problem by my self . I think this might provide you with a clue to the solution this problem. Maybe installing amdgpu-dkms will affect the system's original driver.
sudo apt install amdgpu amdgpu-core amdgpu-lib
--After executing the command, the system desktop environment returns to normal
So I don't see anything obvious for the flickering screen there, but I am less of a graphics guy (the internal ticket should be able to make progress there). For the fan, If you try to just set the fan speed to manual without setting a value, does it stay as "auto"? You can do it manually via: cd /sys/class/drm/card0/device/hwmon cd (on my test machine it's hwmon2 but it depends on your system config) cat ./pwm1_enable (Manual=1, Auto=2, off=0) Then try to set it to manual by> echo 1|sudo tee ./pwm1_enable Then verify it with cat ./pwm1_enable
If it stays at 2, it is likely that the firmware isn't actually changing the setting (and isn't giving us an error to say why). The internal ticket should be able to verify that pretty quickly. If it does change to 1, then maybe there's a bug in the SMI where it's not setting fan control to manual before trying to change the speed
Could I ask your video card model?
It also didn't work
alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ echo 1|sudo tee ./pwm1_enable
1
alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1_enable
2
alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1_enable
2
alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ echo 255|sudo tee ./pwm1
255
tee: ./pwm1: Invalid parameters
alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1
51
alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ echo 1|sudo tee ./pwm1_enable
1
alic-li@alic-li-B660M-D2H-DDR4:/sys/class/drm/card1/device/hwmon/hwmon2$ cat ./pwm1_enable
2
So I think there's didn't have any bug in the SMI. Thanks for you help. I'll wait for the internal ticket's result .In addition,I have a RX-6750xt video card , But it could be adjust the fan speed with same condition. That's really a bit weird. I hope we can make the Radeon software ecosystem better together.
So amdgpu-dkms will replace the amdgpu kernel module with the newer one. It also points to a regression in the newer amdgpu-dkms code.
As for my model that I tested, it's an old Fiji Nano R9 Fury. It does what I need it to do for testing simple things like power management.
The internal ticket should be good @ppanchad-amd . Can you add this info to it as well? Thanks!
@kentrussell Will do. Thanks!
Hi @Alic-Li, an update on this: we've found a driver incompatibility in ROCm 6.2 that can cause the flickering screen + slow app loading issue in some configurations. This is being addressed in future ROCm releases, but for now additional workarounds are installing ROCm using the installer with --usecase=graphics,rocm
or updating mesa drivers. If the solution you found for your system is still working for you, great! If not, you can try one of those additional workarounds. Thanks for bringing this to our attention.
sudo apt install amdgpu amdgpu-core amdgpu-lib
--After executing the command, the system desktop environment returns to normal
I'm glad to hear you found the problem. Thank you for your reply! My solution is still work. Can I close this issue? I hope my solution could help others who meet this problem.
Sure, if your problem has been solved I think we can close this issue. Feel free to reopen it if your solution stops working. Thanks again for your report and investigation!