[Issue]: Linux system cant properly enter sleep mode because of amdgpu driver

Question

[Issue]: Linux system cant properly enter sleep mode because of amdgpu driver

Closed this issue 13 days ago · 20 comments

Problem Description

Hi, I observe my computer sometimes cannot enter the sleep mode.

linux : 6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

amdgpu : 6.8.5

Some pieces from dmesg

[ 6851.472194]  amdgpu_ttm_tt_populate+0xb4/0xf0 [amdgpu]
[ 6851.472551]  amdgpu_ttm_evict_resources+0x36/0x70 [amdgpu]
[ 6851.472779]  amdgpu_device_prepare+0x59/0x180 [amdgpu]
[ 6851.473002]  amdgpu_pmops_prepare+0x43/0x80 [amdgpu]
[ 6851.473414] amdgpu 0000:01:00.0: PM: device_prepare(): pci_pm_prepare+0x0/0x80 returns -12
[ 6851.473419] amdgpu 0000:01:00.0: PM: not prepared for power transition: code -12
[ 6852.488444] amdgpu 0000:01:00.0: PM: device_prepare(): pci_pm_prepare+0x0/0x80 returns -12
[ 6852.488451] amdgpu 0000:01:00.0: PM: not prepared for power transition: code -12
[ 7753.329356]  amdgpu_ttm_tt_populate+0xb4/0xf0 [amdgpu]
[ 7753.329766]  amdgpu_ttm_evict_resources+0x36/0x70 [amdgpu]
[ 7753.329994]  amdgpu_device_prepare+0x59/0x180 [amdgpu]
[ 7753.330217]  amdgpu_pmops_prepare+0x43/0x80 [amdgpu]
[ 7753.330867] amdgpu 0000:01:00.0: PM: device_prepare(): pci_pm_prepare+0x0/0x80 returns -12
[ 7753.330872] amdgpu 0000:01:00.0: PM: not prepared for power transition: code -12
[ 7753.799537] amdgpu 0000:01:00.0: PM: device_prepare(): pci_pm_prepare+0x0/0x80 returns -12
[ 7753.799544] amdgpu 0000:01:00.0: PM: not prepared for power transition: code -12

Operating System

6.8.0-40-generic #40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

CPU

core i5-9600

GPU

AMD Radeon VII

ROCm Version

ROCm 6.2.0

ROCm Component

No response

Steps to Reproduce

Install the drive. Try to put the computer in sleep mode.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

tcgu-amd commented 20 days ago

Any luck?

Answer 1 · 2024-10-15T15:43:27.000Z

Hi @lovely-error. Internal ticket has been created to investigate this issue. Thanks!

Answer 2 · 2024-10-16T14:58:57.000Z

Hi @lovely-error thank you for reaching out! By sometimes, do you mean that the error is transient? Also can you double check to see if runtime power management is enabled from your BIOS and in your system? Thanks!

Answer 3 · 2024-10-18T11:16:34.000Z

@tcgu-amd When I cold boot the PC, on first suspend, everything goes well. But on any subsequent attempt to suspend, the procedure fails, and I see the PM: not prepared for power transition: code -12 errors.

Answer 4 · 2024-10-18T16:44:15.000Z

@lovely-error seems like PM: not prepared for power transition: code -12 suggest that the system is out of memory. Are you running any memory-heavy jobs by any chance? If not then there might potentially be some memory access issues. Do you know how your system is managing power settings? Thanks!

Answer 5 · 2024-10-19T10:29:20.000Z

@tcgu-amd The problem also occurs when nothing runs on a computer besides standard processes; there are always a couple of free gigabytes of RAM under default load too. Unfortunately, I don't know how exactly my system is managing power.

Answer 6 · 2024-10-19T16:07:09.000Z

For anyone in need of an easy way to reproduce this problem:

I'm using Ollama, and I consistently have this issue whenever a model is running on the GPU. When I stop the model, it goes away. Ollama uses about 12 GB of VRAM and I have 32 GB of system memory, so I'm a little surprised that I'm running into memory issues...

Answer 7 · 2024-10-21T17:03:42.000Z

@lovely-error Thank you for the additional context. To help us better diagnose your problem would you mind running

apt show amdgpu

And paste the exact outputs? Thanks!

Answer 8 · 2024-10-21T17:07:48.000Z

For anyone in need of an easy way to reproduce this problem:

I'm using Ollama, and I consistently have this issue whenever a model is running on the GPU. When I stop the model, it goes away. Ollama uses about 12 GB of VRAM and I have 32 GB of system memory, so I'm a little surprised that I'm running into memory issues...

@djpetti Thanks for reporting the problem! Would you mind providing more details so we can try reproducing your problem?

Information that might be helpful to us includes: OS distro/version, ROCm version, GPU model, CPU model, Python Version, output of rocminfo, and the model/command you are using to run Ollama.

Thanks!!

Answer 9 · 2024-10-21T18:17:57.000Z

@djpetti This might be related to your issue (https://gitlab.freedesktop.org/drm/amd/-/issues/2362). It should've been fixed in the latest driver with this commit. 5095d54.

Can you please verify that there is enough ram/swap available that is enough to contain the VRAM?

Also, please try to upgrade your amdgpu driver as well as your linux kernel to the latest versions to see if the problem persists.

Hope this helps.

Thanks!

Answer 10 · 2024-10-21T18:37:34.000Z

@lovely-error @djpetti It will also help if you can set /sys/power/pm_debug_messages to 1 and show the logs from dmesg again. It should reveal more details.

Answer 11 · 2024-10-24T17:06:51.000Z

Hi @tcgu-amd
apt show amdgpu

Package: amdgpu
Version: 1:6.2.60202-2041575.22.04
Priority: optional
Section: metapackages
Maintainer: Advanced Micro Devices (AMD) <slava.grigorev@amd.com>
Installed-Size: 9 216 B
Depends: amdgpu-dkms, amdgpu-lib (= 1:6.2.60202-2041575.22.04)
Download-Size: 1 684 B
APT-Sources: https://repo.radeon.com/amdgpu/latest/ubuntu jammy/main amd64 Packages

What should I look for in dmesg specifically?

[57666.736270] 0 pages hwpoisoned
[57666.736402] [TTM] Buffer eviction failed
[57666.736404] [drm] evicting device resources failed
[57666.736406] amdgpu 0000:01:00.0: PM: device_prepare(): pci_pm_prepare+0x0/0x80 returns -12
[57666.736410] amdgpu 0000:01:00.0: PM: not prepared for power transition: code -12
[57666.736411] PM: start suspend of devices aborted after 1362.589 msecs
[57666.736413] PM: Some devices failed to suspend, or early wake event detected

Answer 12 · 2024-10-24T19:26:19.000Z

Hi @lovely-error, thanks for getting back! What I was trying to see is which part of the driver is causing the problem. Based on your log, I think the problematic section is likely here https://github.com/ROCm/ROCK-Kernel-Driver/blob/master/drivers/gpu/drm/ttm/ttm_resource.c#L374. I am now starting to suspect that the problem might have something to do with DMA, especially since you have an Intel CPU. However, it is strange that it would work the first time you boot your PC. Can you show the dmesg of a successful suspend?

Apologies for the back and forth, and thank you for your patience!

Answer 13 · 2024-10-27T11:54:59.000Z

@tcgu-amd Hi, sorry that I often respond late.
This is an example of successful sleep (I grepped the dmesg with "PM|amdgpu")

[ 4166.691184] PM: suspend entry (deep)
[ 4168.112939] amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset
[ 4168.136864] ACPI: PM: Preparing to enter system sleep state S3
[ 4168.347483] ACPI: PM: Saving platform NVS memory
[ 4168.369551] ACPI: PM: Low-level resume complete
[ 4168.369623] ACPI: PM: Restoring platform NVS memory
[ 4168.376542] ACPI: PM: Waking up from system sleep state S3
[ 4168.863365] PM: suspend exit

I also found this in dmesg. Perhaps, it is relevant

[ 2983.868066] Call Trace:
[ 2983.868067]  <TASK>
[ 2983.868069]  dump_stack_lvl+0x76/0xa0
[ 2983.868074]  dump_stack+0x10/0x20
[ 2983.868076]  warn_alloc+0x174/0x1f0
[ 2983.868080]  __alloc_pages_slowpath.constprop.0+0x911/0x9e0
[ 2983.868084]  __alloc_pages+0x31d/0x350
[ 2983.868087]  amdttm_pool_alloc+0x1b3/0x5e0 [amdttm]
[ 2983.868098]  amdgpu_ttm_tt_populate+0xb4/0xf0 [amdgpu]
[ 2983.868419]  amdttm_tt_populate+0xb1/0x170 [amdttm]
[ 2983.868426]  ttm_bo_handle_move_mem+0x1b1/0x1f0 [amdttm]
[ 2983.868434]  ttm_mem_evict_first+0x425/0x5c0 [amdttm]
[ 2983.868441]  amdttm_resource_manager_evict_all+0x9a/0x210 [amdttm]
[ 2983.868449]  ? __pfx_pci_pm_prepare+0x10/0x10
[ 2983.868452]  amdgpu_ttm_evict_resources+0x36/0x70 [amdgpu]
[ 2983.868687]  amdgpu_device_prepare+0x59/0x180 [amdgpu]
[ 2983.868908]  ? __pfx_pci_pm_prepare+0x10/0x10
[ 2983.868910]  amdgpu_pmops_prepare+0x43/0x80 [amdgpu]
[ 2983.869129]  pci_pm_prepare+0x32/0x80
[ 2983.869131]  device_prepare+0x93/0x1e0
[ 2983.869134]  dpm_prepare+0xcb/0x2b0
[ 2983.869137]  dpm_suspend_start+0x25/0xc0
[ 2983.869140]  suspend_devices_and_enter+0x172/0x2f0
[ 2983.869142]  enter_state+0x21b/0x5f0
[ 2983.869144]  pm_suspend+0x44/0xe0
[ 2983.869146]  state_store+0x2b/0x60
[ 2983.869148]  kobj_attr_store+0xf/0x40
[ 2983.869150]  sysfs_kf_write+0x3b/0x60
[ 2983.869153]  kernfs_fop_write_iter+0x130/0x210
[ 2983.869155]  vfs_write+0x2a5/0x480
[ 2983.869159]  ksys_write+0x73/0x100
[ 2983.869161]  __x64_sys_write+0x19/0x30
[ 2983.869162]  x64_sys_call+0x23e1/0x24b0
[ 2983.869164]  do_syscall_64+0x81/0x170
[ 2983.869168]  ? syscall_exit_to_user_mode+0x89/0x260
[ 2983.869170]  ? do_syscall_64+0x8d/0x170
[ 2983.869172]  ? filemap_map_pages+0x2f9/0x4c0
[ 2983.869176]  ? do_read_fault+0x112/0x1d0
[ 2983.869179]  ? do_fault+0x109/0x350
[ 2983.869181]  ? handle_pte_fault+0x114/0x1d0
[ 2983.869183]  ? __handle_mm_fault+0x64e/0x790
[ 2983.869186]  ? __count_memcg_events+0x80/0x130
[ 2983.869188]  ? count_memcg_events.constprop.0+0x2a/0x50
[ 2983.869191]  ? handle_mm_fault+0xad/0x380
[ 2983.869194]  ? do_user_addr_fault+0x337/0x670
[ 2983.869196]  ? irqentry_exit_to_user_mode+0x7e/0x260
[ 2983.869198]  ? irqentry_exit+0x43/0x50
[ 2983.869200]  ? exc_page_fault+0x94/0x1b0
[ 2983.869202]  entry_SYSCALL_64_after_hwframe+0x78/0x80

PS. No worries about back and forth. I am glad you're investigating this

Answer 14 · 2024-10-27T20:14:11.000Z

@djpetti Thanks for reporting the problem! Would you mind providing more details so we can try reproducing your problem?

Information that might be helpful to us includes: OS distro/version, ROCm version, GPU model, CPU model, Python Version, output of rocminfo, and the model/command you are using to run Ollama.

Thanks!!

Sorry for the late response. Here are my system specs:

Ubuntu 24.04.1
ROCm runtime 1.1
GPU: 7900 XT
CPU: 3900X
Python 3.12.3

Output of rocminfo:

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 9 3900X 12-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 9 3900X 12-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3800
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            24
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    32773036(0x1f413ac) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32773036(0x1f413ac) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    32773036(0x1f413ac) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-c43c29b0aadcacb4
  Marketing Name:          AMD Radeon RX 7900 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      81920(0x14000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2025
  BDFID:                   3072
  Internal Node ID:        1
  Compute Unit:            84
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 550
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    20955136(0x13fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS:
      Size:                    20955136(0x13fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

ollama was started with:

ollama run deepseek-coder-v2:16b

nvtop reports GPU usage of ~12 GB when the model is running. As I previously mentioned, I have 32 GB of RAM and very little swap (~2GB). Memory pressure on my system was relatively low when I was testing, so I think everything will fit in RAM+SWAP, but I'll test with as little system RAM usage as possible just to be sure.

My kernel should already be up-to-date. I'm not sure how to update the AMD driver (I should be running the open source one? I was actually running the proprietary one for awhile, but it completely broke Wayland on Ubuntu 24, so I switched back a few weeks ago.)

Answer 15 · 2024-10-27T20:18:49.000Z

Just confirmed that this happens even when I'm sure I have sufficient space in system RAM to contain the VRAM. (I had ~4 GB out of 32 of system RAM used, and tried to suspend with ~12 GB of VRAM in use.)

I got some interesting errors in dmesg:

[ 2631.213806] PM: suspend entry (deep)
[ 2631.216586] Filesystems sync: 0.002 seconds
[ 2631.326659] Freezing user space processes
[ 2631.328705] Freezing user space processes completed (elapsed 0.002 seconds)
[ 2631.328708] OOM killer disabled.
[ 2631.328710] Freezing remaining freezable tasks
[ 2631.330301] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[ 2631.330317] printk: Suspending console(s) (use no_console_suspend to debug)
[ 2635.584348] systemd-sleep: page allocation failure: order:0, mode:0x100c02(GFP_NOIO|__GFP_HIGHMEM|__GFP_HARDWALL), nodemask=(null),cpuset=systemd-suspend.service,mems_allowed=0
[ 2635.584362] CPU: 5 PID: 18888 Comm: systemd-sleep Not tainted 6.8.0-47-generic #47-Ubuntu
[ 2635.584365] Hardware name: Gigabyte Technology Co., Ltd. X570 I AORUS PRO WIFI/X570 I AORUS PRO WIFI, BIOS F37d 07/18/2023
[ 2635.584366] Call Trace:
[ 2635.584368]  <TASK>
[ 2635.584372]  dump_stack_lvl+0x76/0xa0
[ 2635.584377]  dump_stack+0x10/0x20
[ 2635.584379]  warn_alloc+0x174/0x1f0
[ 2635.584385]  __alloc_pages_slowpath.constprop.0+0x936/0x9f0
[ 2635.584391]  __alloc_pages+0x31f/0x350
[ 2635.584396]  ttm_pool_alloc_page+0x53/0x1a0 [ttm]
[ 2635.584405]  ttm_pool_alloc+0x168/0x3b0 [ttm]
[ 2635.584411]  ? srso_return_thunk+0x5/0x5f
[ 2635.584418]  amdgpu_ttm_tt_populate+0xb4/0xf0 [amdgpu]
[ 2635.584669]  ttm_tt_populate+0xb4/0x170 [ttm]
[ 2635.584676]  ttm_bo_handle_move_mem+0x1b1/0x1f0 [ttm]
[ 2635.584684]  ttm_bo_evict+0xda/0x230 [ttm]
[ 2635.584692]  ? srso_return_thunk+0x5/0x5f
[ 2635.584696]  ttm_mem_evict_first+0x226/0x3e0 [ttm]
[ 2635.584704]  ttm_resource_manager_evict_all+0x9a/0x210 [ttm]
[ 2635.584712]  ? __pfx_pci_pm_prepare+0x10/0x10
[ 2635.584715]  amdgpu_ttm_evict_resources+0x36/0x70 [amdgpu]
[ 2635.584930]  amdgpu_device_prepare+0x5c/0x1a0 [amdgpu]
[ 2635.585141]  ? __pfx_pci_pm_prepare+0x10/0x10
[ 2635.585143]  amdgpu_pmops_prepare+0x43/0x80 [amdgpu]
[ 2635.585353]  pci_pm_prepare+0x35/0x80
[ 2635.585355]  device_prepare+0x96/0x1e0
[ 2635.585360]  dpm_prepare+0xcb/0x2b0
[ 2635.585363]  dpm_suspend_start+0x25/0xc0
[ 2635.585366]  suspend_devices_and_enter+0x172/0x2f0
[ 2635.585369]  enter_state+0x21b/0x5f0
[ 2635.585372]  pm_suspend+0x44/0xe0
[ 2635.585374]  state_store+0x2b/0x60
[ 2635.585377]  kobj_attr_store+0x12/0x40
[ 2635.585380]  sysfs_kf_write+0x3e/0x60
[ 2635.585383]  kernfs_fop_write_iter+0x14f/0x1e0
[ 2635.585387]  vfs_write+0x2a8/0x480
[ 2635.585392]  ksys_write+0x73/0x100
[ 2635.585395]  __x64_sys_write+0x19/0x30
[ 2635.585397]  x64_sys_call+0x7e/0x25c0
[ 2635.585400]  do_syscall_64+0x7f/0x180
[ 2635.585403]  ? srso_return_thunk+0x5/0x5f
[ 2635.585406]  ? irqentry_exit_to_user_mode+0x7e/0x260
[ 2635.585409]  ? srso_return_thunk+0x5/0x5f
[ 2635.585412]  ? irqentry_exit+0x43/0x50
[ 2635.585414]  ? srso_return_thunk+0x5/0x5f
[ 2635.585416]  ? exc_page_fault+0x94/0x1b0
[ 2635.585420]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[ 2635.585423] RIP: 0033:0x72557df1c574
[ 2635.585441] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[ 2635.585443] RSP: 002b:00007ffeb292e9d8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 2635.585446] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 000072557df1c574
[ 2635.585447] RDX: 0000000000000004 RSI: 000062729c2d7470 RDI: 0000000000000005
[ 2635.585448] RBP: 00007ffeb292ea00 R08: 000072557e003b20 R09: 0000000000000000
[ 2635.585449] R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000000004
[ 2635.585451] R13: 000062729c2d7470 R14: 000062729c2d32d0 R15: 000072557e001ee0
[ 2635.585455]  </TASK>
[ 2635.585461] Mem-Info:
[ 2635.585463] active_anon:727919 inactive_anon:6422 isolated_anon:0
                active_file:3071124 inactive_file:1547455 isolated_file:0
                unevictable:4 dirty:19 writeback:0
                slab_reclaimable:119305 slab_unreclaimable:105380
                mapped:478514 shmem:8677 pagetables:9046
                sec_pagetables:0 bounce:0
                kernel_misc_reclaimable:0
                free:93780 free_pcp:503 free_cma:0
[ 2635.585468] Node 0 active_anon:2911676kB inactive_anon:25688kB active_file:12284496kB inactive_file:6189820kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:1914056kB dirty:76kB writeback:0kB shmem:34708kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:241664kB writeback_tmp:0kB kernel_stack:20720kB pagetables:36184kB sec_pagetables:0kB all_unreclaimable? no
[ 2635.585472] Node 0 DMA free:11276kB boost:0kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15372kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 2635.585477] lowmem_reserve[]: 0 2781 31842 31842 31842
[ 2635.585482] Node 0 DMA32 free:122140kB boost:0kB min:5900kB low:8748kB high:11596kB reserved_highatomic:0KB active_anon:3084kB inactive_anon:0kB active_file:2843688kB inactive_file:0kB unevictable:0kB writepending:0kB present:3058304kB managed:2992200kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 2635.585486] lowmem_reserve[]: 0 0 29060 29060 29060
[ 2635.585491] Node 0 Normal free:241704kB boost:181740kB min:243388kB low:273144kB high:302900kB reserved_highatomic:0KB active_anon:2908592kB inactive_anon:25688kB active_file:9440808kB inactive_file:6189820kB unevictable:16kB writepending:76kB present:30395392kB managed:29765464kB mlocked:16kB bounce:0kB free_pcp:2012kB local_pcp:2012kB free_cma:0kB
[ 2635.585496] lowmem_reserve[]: 0 0 0 0 0
[ 2635.585500] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11276kB
[ 2635.585516] Node 0 DMA32: 3*4kB (U) 2*8kB (UM) 8*16kB (UME) 6*32kB (UM) 7*64kB (UME) 8*128kB (UME) 4*256kB (UM) 5*512kB (UE) 6*1024kB (UME) 0*2048kB 27*4096kB (M) = 122140kB
[ 2635.585532] Node 0 Normal: 27078*4kB (ME) 8450*8kB (ME) 2354*16kB (UME) 879*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 241704kB
[ 2635.585546] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 2635.585547] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2635.585549] 4627298 total pagecache pages
[ 2635.585550] 29 pages in swap cache
[ 2635.585551] Free swap  = 2096892kB
[ 2635.585552] Total swap = 2097148kB
[ 2635.585553] 8367423 pages RAM
[ 2635.585553] 0 pages HighMem/MovableOnly
[ 2635.585554] 174164 pages reserved
[ 2635.585555] 0 pages hwpoisoned
[ 2635.857706] [TTM] Buffer eviction failed
[ 2635.857711] [drm] evicting device resources failed
[ 2635.857714] amdgpu 0000:0c:00.0: PM: device_prepare(): pci_pm_prepare+0x0/0x80 returns -12
[ 2635.857721] amdgpu 0000:0c:00.0: PM: not prepared for power transition: code -12
[ 2635.857723] PM: start suspend of devices aborted after 4527.420 msecs
[ 2635.857725] PM: Some devices failed to suspend, or early wake event detected
[ 2635.857728] PM: resume of devices complete after 0.001 msecs
[ 2635.858425] OOM killer enabled.
[ 2635.858426] Restarting tasks ... done.
[ 2635.860240] random: crng reseeded on system resumption
[ 2635.994402] PM: suspend exit
[ 2635.994435] thermal thermal_zone3: failed to read out thermal zone (-61)
[ 2635.994472] PM: suspend entry (s2idle)
[ 2635.997493] Filesystems sync: 0.003 seconds
[ 2636.011483] Freezing user space processes
[ 2636.012879] amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 2636.012886] amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000000000201000 from client 10
[ 2636.012889] amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[ 2636.012892] amdgpu 0000:0c:00.0: amdgpu: 	 Faulty UTCL2 client ID: CPC (0x5)
[ 2636.012895] amdgpu 0000:0c:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 2636.012897] amdgpu 0000:0c:00.0: amdgpu: 	 WALKER_ERROR: 0x1
[ 2636.012899] amdgpu 0000:0c:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 2636.012901] amdgpu 0000:0c:00.0: amdgpu: 	 MAPPING_ERROR: 0x1
[ 2636.012903] amdgpu 0000:0c:00.0: amdgpu: 	 RW: 0x0
[ 2636.013490] Freezing user space processes completed (elapsed 0.002 seconds)
[ 2636.013493] OOM killer disabled.
[ 2636.013494] Freezing remaining freezable tasks
[ 2636.015062] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[ 2636.015065] printk: Suspending console(s) (use no_console_suspend to debug)
[ 2637.348420] [TTM] Buffer eviction failed
[ 2637.348425] [drm] evicting device resources failed
[ 2637.348428] amdgpu 0000:0c:00.0: PM: device_prepare(): pci_pm_prepare+0x0/0x80 returns -12
[ 2637.348435] amdgpu 0000:0c:00.0: PM: not prepared for power transition: code -12
[ 2637.348437] PM: start suspend of devices aborted after 1333.347 msecs
[ 2637.348440] PM: Some devices failed to suspend, or early wake event detected
[ 2637.348441] PM: resume of devices complete after 0.001 msecs
[ 2637.348994] OOM killer enabled.
[ 2637.348995] Restarting tasks ... done.
[ 2637.350326] random: crng reseeded on system resumption
[ 2637.370362] PM: suspend exit
[ 2637.370385] thermal thermal_zone3: failed to read out thermal zone (-61)
[ 2638.119739] amdgpu 0000:0c:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[ 2638.119752] amdgpu 0000:0c:00.0: amdgpu:   in page starting at address 0x0000000000201000 from client 10
[ 2638.119759] amdgpu 0000:0c:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[ 2638.119764] amdgpu 0000:0c:00.0: amdgpu: 	 Faulty UTCL2 client ID: CPC (0x5)
[ 2638.119769] amdgpu 0000:0c:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[ 2638.119773] amdgpu 0000:0c:00.0: amdgpu: 	 WALKER_ERROR: 0x1
[ 2638.119777] amdgpu 0000:0c:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[ 2638.119781] amdgpu 0000:0c:00.0: amdgpu: 	 MAPPING_ERROR: 0x1
[ 2638.119786] amdgpu 0000:0c:00.0: amdgpu: 	 RW: 0x0

Answer 16 · 2024-10-29T19:37:28.000Z

@djpetti @lovely-error Bit of an update, we have identified this as a known issue due to lack of contiguous memory in the RAM. We are investigating for a workaround. Thanks!

Answer 17 · 2024-11-04T21:58:00.000Z

Hi @djpetti @lovely-error, sorry for the delayed response. After some investigation, one particular workaround seemed promising

On your host machine, choose a temporary directory (in the following code we will use ~/tmp) and perform the following

cd ~/tmp
git clone https://git.dolansoft.org/lorenz/memreserver
cd memreserver
sudo apt install meson
meson build
cd build
meson install
sudo cp memreserver.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl start memreserver.service
sudo systemctl enable memreserver.service

This will hook a custom c script that reserve memory before the system goes to sleep. After the above steps, try letting your PC sleep again and see if that will fix the issue. Thanks!

Disclaimer: Please note that this script belongs to a third party (original link https://git.dolansoft.org/lorenz/memreserver) and works solely as a temporary solution. It is not an official work around from AMD. Please use it at your own discretion. Thanks for understanding!

Answer 18 · 2024-11-16T21:22:11.000Z

Hey, sorry for the late reply. I can confirm that this service fixes the sleeping issue. Thanks for your help!

Answer 19 · 2024-11-18T19:08:27.000Z

@djpetti Glad it works for you! Since @lovely-error hasn't been responding I will mark this issue as resolved for now and close it. @lovely-error please feel free to post follow-ups after you have had a chance to try the work-around. Thanks!