pop-os/system76-power

NVME U.2 drive temperature not considered for fan duty cycle

ErichRitz opened this issue · 2 comments

Distribution (run cat /etc/os-release):

 # cat /etc/os-release 
NAME=Slackware
VERSION="15.0"
ID=slackware
VERSION_ID=15.0
PRETTY_NAME="Slackware 15.0 x86_64"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:slackware:slackware_linux:15.0"
HOME_URL="http://slackware.com/"
SUPPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"
BUG_REPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"

# cat /sys/class/dmi/id/product_version 
thelio-massive-b1

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

# ls /var/lib/pkgtools/packages/system76-power*
/var/lib/pkgtools/packages/system76-power-1.1.24_c504ff6-x86_64-4_SBo

I'm using this patch as well, as my computer is not stable without it: #321

Issue/Bug Description:
I happened to notice that my NVME U.2 drive is running really hot (80° C). According to https://superuser.com/questions/1592187/should-i-worry-about-high-ssd-temperature 70° C should be the maximum operating temperature.

Here is the current smartctl output (note the FAILED status due to temperature):

# smartctl -a /dev/nvme3n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.4.9-etr] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Micron_9200_MTFDHAL7T6TCT
Serial Number:                      18281E59F6D6
Firmware Version:                   101008P0
PCI Vendor/Subsystem ID:            0x1344
IEEE OUI Identifier:                0x00e0cf
Total NVM Capacity:                 7,681,501,126,656 [7.68 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          7,681,501,126,656 [7.68 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            75a000 d6f6591e01
Local Time is:                      Wed Aug  9 10:02:51 2023 CDT
Firmware Updates (0x07):            3 Slots, Slot 1 R/O
Optional Admin Commands (0x000e):   Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x0014):     DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     75 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0      100     100

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         0
 2 -     512       0         2
 3 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- temperature is above or below threshold

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x02
Temperature:                        79 Celsius
Available Spare:                    99%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    658,397,004 [337 TB]
Data Units Written:                 234,496,862 [120 TB]
Host Read Commands:                 6,196,234,824
Host Write Commands:                4,012,847,534
Controller Busy Time:               38,703
Power Cycles:                       638
Power On Hours:                     28,409
Unsafe Shutdowns:                   118
Media and Data Integrity Errors:    617,438,105
Error Information Log Entries:      617,438,548
Warning  Comp. Temperature Time:    11125
Critical Comp. Temperature Time:    896
Temperature Sensor 1:               87 Celsius
Temperature Sensor 2:               79 Celsius
Temperature Sensor 3:               72 Celsius
Temperature Sensor 4:               72 Celsius

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0  617289268     0  0x0000  0x1001  0x028            -     0     -
  1  617289267     0  0x0000  0x1001  0x028            -     0     -
  2  617289266     0  0x0000  0x1001  0x028            -     0     -
  3  617289265     0  0x0000  0x1001  0x028            -     0     -
  4  617289264     0  0x0000  0x1001  0x028            -     0     -
  5  617289263     0  0x0000  0x1001  0x028            -     0     -
  6  617289262     0  0x0000  0x1001  0x028            -     0     -
  7  617289261     0  0x0000  0x1001  0x028            -     0     -
  8  617289260     0  0x0000  0x1001  0x028            -     0     -
  9  617289259     0  0x0000  0x1001  0x028            -     0     -
 10  617289258     0  0x0000  0x1001  0x028            -     0     -
 11  617289257     0  0x0000  0x1001  0x028            -     0     -
 12  617289256     0  0x0000  0x1001  0x028            -     0     -
 13  617289255     0  0x0000  0x1001  0x028            -     0     -
 14  617289254     0  0x0000  0x1001  0x028            -     0     -
 15  617289253     0  0x0000  0x1001  0x028            -     0     -
... (47 entries not read)

Steps to reproduce (if you know):
Do some heavy I/O tasks and watch the drive temperature increase. As long as the CPU load isn't too high the fan speed will NOT increase.

Expected behavior:
Case fan speed should increase when U.2 drive temperature (and/or 2.5" HDD drives!) increases.

Other Notes:

Here are some pictures generated from my logs (I have a custom script to record temperatures every 5 seconds):

The NVME drive labeled "Samsung" is the boot drive (stick on the motherboard). The drive labeled "Micron" is the 2.5" U.2 drive. I wish I was recording my 2.5" HDDs, but I'm not. (Daily cron jobs run at 4:40 AM everyday which restarts the monitor script (and other cron jobs), hence the spike in temps and sudden end of the log.)

sensors_nvme_20230808

Note this is the CPU fan command, not the case fan command. But all fans are set to the same duty cycle so it's a fair proxy.

sensors_fan_cmd_20230808

Note the 2 sets of FAN RPMs. The higher one is the CPU fan. The lower one is the 2 case fans.

sensors_fan_cpu_20230808

In my opinion the U.2 (and 2.5" HDDs) drive(s) temperature should also be considered in calculating the case fan speed. Right now the logic is to use the CPU temperature (and NVIDIA temperature, if available) to set both the CPU fans and case fans.

My plan is to write a patch to do so, as right now my U.2 drive is slowing to a crawl and btrfs is throwing errors in syslog due to the high temps.

For now I wrote a script to force the 2 case fans (label INTF in /sys/class/hwmon) to max speed (trying to override what system76-power is doing):

#!/bin/bash

FAN1=hwmon5
FAN2=hwmon6

set_fan_speed() {
	echo 255 > /sys/class/hwmon/$FAN1/pwm2
	echo 255 > /sys/class/hwmon/$FAN2/pwm2
}

while true; do
	set_fan_speed
	sleep 0.25s
done

And it does seem to help (you can see I started it at about 10:15), but the drive is still running hot (but slowly coming down):

sensors_nvme_20230808b

sensors_fan_20230808b

Update:

I said earlier maybe the case fans and CPU fans should be set independently. I don't think so anymore. All of them on MAX still aren't enough to cool the U.2 drives.

(Actually just yesterday I bought and installed 2 more U.2 drives, labeled "Intel1" and "Intel2" in the NVME Temperature plot below. I didn't include them in the plots above because they weren't installed until late in the day Aug 8th and didn't want to add confusion. But including now for completeness.)

You can see just before 11:00 AM I modified my fan override script to also set the CPU fan to MAX:

#!/bin/bash

FAN1=hwmon5
FAN2=hwmon6

set_fan_speed() {
	echo 255 > /sys/class/hwmon/$FAN1/pwm2
	echo 255 > /sys/class/hwmon/$FAN2/pwm2
	echo 255 > /sys/class/hwmon/$FAN1/pwm1
}

while true; do
	set_fan_speed
	sleep 0.25s
done

Looks like the temps on the SSDs are stabilizing... above 70° for 2 out of the 3 U.2 SSD drives. At least the GPU fan doesn't have to work as hard...

sensors_fan_cmd_20230808c

sensors_fan_20230808c

sensors_nvme_20230808c

sensors_fan_cpu_20230808c

sensors_gpu_20230808c

I figured out how to get the HDD temperature from smartctl. The lifetime max temperature for /dev/sda is 72° C! (Even within the last few days it has hit 71° C.) The maximum recommend temperature (according to smartctl) is 55° C:

# smartctl -l scttemp /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.4.9-etr] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
Device State:                        Active (0)
Current Temperature:                    38 Celsius
Power Cycle Min/Max Temperature:     35/38 Celsius
Lifetime    Min/Max Temperature:     17/72 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         3 minutes
Temperature Logging Interval:        59 minutes
Min/Max recommended Temperature:     14/55 Celsius
Min/Max Temperature Limit:           10/60 Celsius
Temperature History Size (Index):    128 (46)

Index    Estimated Time   Temperature Celsius
  47    2023-08-04 07:53    48  *****************************
  48    2023-08-04 08:52     ?  -
  49    2023-08-04 09:51    33  **************
  50    2023-08-04 10:50     ?  -
  51    2023-08-04 11:49    39  ********************
  52    2023-08-04 12:48     ?  -
  53    2023-08-04 13:47    42  ***********************
  54    2023-08-04 14:46    50  *******************************
  55    2023-08-04 15:45    51  ********************************
  56    2023-08-04 16:44    52  *********************************
  57    2023-08-04 17:43    51  ********************************
  58    2023-08-04 18:42    51  ********************************
  59    2023-08-04 19:41     ?  -
  60    2023-08-04 20:40    47  ****************************
  61    2023-08-04 21:39    49  ******************************
  62    2023-08-04 22:38    50  *******************************
  63    2023-08-04 23:37    50  *******************************
  64    2023-08-05 00:36    48  *****************************
  65    2023-08-05 01:35    46  ***************************
  66    2023-08-05 02:34    49  ******************************
  67    2023-08-05 03:33    51  ********************************
  68    2023-08-05 04:32    52  *********************************
  69    2023-08-05 05:31    49  ******************************
  70    2023-08-05 06:30    48  *****************************
  71    2023-08-05 07:29    48  *****************************
  72    2023-08-05 08:28    49  ******************************
  73    2023-08-05 09:27    50  *******************************
  74    2023-08-05 10:26    49  ******************************
  75    2023-08-05 11:25    48  *****************************
  76    2023-08-05 12:24    48  *****************************
  77    2023-08-05 13:23    48  *****************************
  78    2023-08-05 14:22    49  ******************************
  79    2023-08-05 15:21    49  ******************************
  80    2023-08-05 16:20    51  ********************************
  81    2023-08-05 17:19    50  *******************************
  82    2023-08-05 18:18    50  *******************************
  83    2023-08-05 19:17    50  *******************************
  84    2023-08-05 20:16    51  ********************************
  85    2023-08-05 21:15    51  ********************************
  86    2023-08-05 22:14    51  ********************************
  87    2023-08-05 23:13    52  *********************************
  88    2023-08-06 00:12    52  *********************************
  89    2023-08-06 01:11    51  ********************************
  90    2023-08-06 02:10    51  ********************************
  91    2023-08-06 03:09    51  ********************************
  92    2023-08-06 04:08     ?  -
  93    2023-08-06 05:07    50  *******************************
  94    2023-08-06 06:06    62  ***************************************+
  95    2023-08-06 07:05    65  ***************************************+
  96    2023-08-06 08:04    66  ***************************************+
  97    2023-08-06 09:03    63  ***************************************+
 ...    ..(  3 skipped).    ..  ***************************************+
 101    2023-08-06 12:59    63  ***************************************+
 102    2023-08-06 13:58    62  ***************************************+
 103    2023-08-06 14:57    63  ***************************************+
 104    2023-08-06 15:56    62  ***************************************+
 105    2023-08-06 16:55    58  ***************************************
 106    2023-08-06 17:54    59  ****************************************
 107    2023-08-06 18:53    50  *******************************
 108    2023-08-06 19:52     ?  -
 109    2023-08-06 20:51    44  *************************
 110    2023-08-06 21:50    47  ****************************
 111    2023-08-06 22:49    47  ****************************
 112    2023-08-06 23:48     ?  -
 113    2023-08-07 00:47    24  *****
 114    2023-08-07 01:46     ?  -
 115    2023-08-07 02:45    38  *******************
 116    2023-08-07 03:44    46  ***************************
 117    2023-08-07 04:43    46  ***************************
 118    2023-08-07 05:42    54  ***********************************
 119    2023-08-07 06:41    54  ***********************************
 120    2023-08-07 07:40    57  **************************************
 121    2023-08-07 08:39     ?  -
 122    2023-08-07 09:38    48  *****************************
 123    2023-08-07 10:37     ?  -
 124    2023-08-07 11:36    32  *************
 125    2023-08-07 12:35    51  ********************************
 126    2023-08-07 13:34    65  ***************************************+
 127    2023-08-07 14:33    66  ***************************************+
   0    2023-08-07 15:32    68  ***************************************+
   1    2023-08-07 16:31    66  ***************************************+
   2    2023-08-07 17:30    51  ********************************
   3    2023-08-07 18:29    51  ********************************
   4    2023-08-07 19:28    69  ***************************************+
   5    2023-08-07 20:27    71  ***************************************+
   6    2023-08-07 21:26    71  ***************************************+
   7    2023-08-07 22:25    66  ***************************************+
   8    2023-08-07 23:24    64  ***************************************+
   9    2023-08-08 00:23    64  ***************************************+
  10    2023-08-08 01:22    63  ***************************************+
  11    2023-08-08 02:21    62  ***************************************+
  12    2023-08-08 03:20    62  ***************************************+
  13    2023-08-08 04:19    62  ***************************************+
  14    2023-08-08 05:18    58  ***************************************
  15    2023-08-08 06:17    50  *******************************
  16    2023-08-08 07:16    49  ******************************
  17    2023-08-08 08:15    46  ***************************
  18    2023-08-08 09:14    46  ***************************
  19    2023-08-08 10:13    45  **************************
  20    2023-08-08 11:12    45  **************************
  21    2023-08-08 12:11    47  ****************************
  22    2023-08-08 13:10    48  *****************************
  23    2023-08-08 14:09    51  ********************************
  24    2023-08-08 15:08    50  *******************************
  25    2023-08-08 16:07    48  *****************************
  26    2023-08-08 17:06    49  ******************************
  27    2023-08-08 18:05    51  ********************************
  28    2023-08-08 19:04     ?  -
  29    2023-08-08 20:03    42  ***********************
  30    2023-08-08 21:02    46  ***************************
  31    2023-08-08 22:01    46  ***************************
  32    2023-08-08 23:00    46  ***************************
  33    2023-08-08 23:59    47  ****************************
  34    2023-08-09 00:58    51  ********************************
  35    2023-08-09 01:57    54  ***********************************
  36    2023-08-09 02:56    54  ***********************************
  37    2023-08-09 03:55    52  *********************************
  38    2023-08-09 04:54    53  **********************************
  39    2023-08-09 05:53    55  ************************************
  40    2023-08-09 06:52    52  *********************************
  41    2023-08-09 07:51    54  ***********************************
  42    2023-08-09 08:50    54  ***********************************
  43    2023-08-09 09:49     ?  -
  44    2023-08-09 10:48    41  **********************
  45    2023-08-09 11:47     ?  -
  46    2023-08-09 12:46    35  ****************

/dev/sda is the "Seagate Barracuda 2.5 5400" drive (ST5000LM000-2AN170) that came with the machine.