NVME U.2 drive temperature not considered for fan duty cycle
ErichRitz opened this issue · 2 comments
Distribution (run cat /etc/os-release
):
# cat /etc/os-release
NAME=Slackware
VERSION="15.0"
ID=slackware
VERSION_ID=15.0
PRETTY_NAME="Slackware 15.0 x86_64"
ANSI_COLOR="0;34"
CPE_NAME="cpe:/o:slackware:slackware_linux:15.0"
HOME_URL="http://slackware.com/"
SUPPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"
BUG_REPORT_URL="http://www.linuxquestions.org/questions/slackware-14/"
# cat /sys/class/dmi/id/product_version
thelio-massive-b1
Related Application and/or Package Version (run apt policy $PACKAGE NAME
):
# ls /var/lib/pkgtools/packages/system76-power*
/var/lib/pkgtools/packages/system76-power-1.1.24_c504ff6-x86_64-4_SBo
I'm using this patch as well, as my computer is not stable without it: #321
Issue/Bug Description:
I happened to notice that my NVME U.2 drive is running really hot (80° C). According to https://superuser.com/questions/1592187/should-i-worry-about-high-ssd-temperature 70° C should be the maximum operating temperature.
Here is the current smartctl output (note the FAILED status due to temperature):
# smartctl -a /dev/nvme3n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.4.9-etr] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Micron_9200_MTFDHAL7T6TCT
Serial Number: 18281E59F6D6
Firmware Version: 101008P0
PCI Vendor/Subsystem ID: 0x1344
IEEE OUI Identifier: 0x00e0cf
Total NVM Capacity: 7,681,501,126,656 [7.68 TB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 7,681,501,126,656 [7.68 TB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 75a000 d6f6591e01
Local Time is: Wed Aug 9 10:02:51 2023 CDT
Firmware Updates (0x07): 3 Slots, Slot 1 R/O
Optional Admin Commands (0x000e): Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x0014): DS_Mngmt Sav/Sel_Feat
Log Page Attributes (0x02): Cmd_Eff_Lg
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 75 Celsius
Critical Comp. Temp. Threshold: 80 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 100 100
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 0
2 - 512 0 2
3 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- temperature is above or below threshold
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x02
Temperature: 79 Celsius
Available Spare: 99%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 658,397,004 [337 TB]
Data Units Written: 234,496,862 [120 TB]
Host Read Commands: 6,196,234,824
Host Write Commands: 4,012,847,534
Controller Busy Time: 38,703
Power Cycles: 638
Power On Hours: 28,409
Unsafe Shutdowns: 118
Media and Data Integrity Errors: 617,438,105
Error Information Log Entries: 617,438,548
Warning Comp. Temperature Time: 11125
Critical Comp. Temperature Time: 896
Temperature Sensor 1: 87 Celsius
Temperature Sensor 2: 79 Celsius
Temperature Sensor 3: 72 Celsius
Temperature Sensor 4: 72 Celsius
Error Information (NVMe Log 0x01, 16 of 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 617289268 0 0x0000 0x1001 0x028 - 0 -
1 617289267 0 0x0000 0x1001 0x028 - 0 -
2 617289266 0 0x0000 0x1001 0x028 - 0 -
3 617289265 0 0x0000 0x1001 0x028 - 0 -
4 617289264 0 0x0000 0x1001 0x028 - 0 -
5 617289263 0 0x0000 0x1001 0x028 - 0 -
6 617289262 0 0x0000 0x1001 0x028 - 0 -
7 617289261 0 0x0000 0x1001 0x028 - 0 -
8 617289260 0 0x0000 0x1001 0x028 - 0 -
9 617289259 0 0x0000 0x1001 0x028 - 0 -
10 617289258 0 0x0000 0x1001 0x028 - 0 -
11 617289257 0 0x0000 0x1001 0x028 - 0 -
12 617289256 0 0x0000 0x1001 0x028 - 0 -
13 617289255 0 0x0000 0x1001 0x028 - 0 -
14 617289254 0 0x0000 0x1001 0x028 - 0 -
15 617289253 0 0x0000 0x1001 0x028 - 0 -
... (47 entries not read)
Steps to reproduce (if you know):
Do some heavy I/O tasks and watch the drive temperature increase. As long as the CPU load isn't too high the fan speed will NOT increase.
Expected behavior:
Case fan speed should increase when U.2 drive temperature (and/or 2.5" HDD drives!) increases.
Other Notes:
Here are some pictures generated from my logs (I have a custom script to record temperatures every 5 seconds):
The NVME drive labeled "Samsung" is the boot drive (stick on the motherboard). The drive labeled "Micron" is the 2.5" U.2 drive. I wish I was recording my 2.5" HDDs, but I'm not. (Daily cron jobs run at 4:40 AM everyday which restarts the monitor script (and other cron jobs), hence the spike in temps and sudden end of the log.)
Note this is the CPU fan command, not the case fan command. But all fans are set to the same duty cycle so it's a fair proxy.
Note the 2 sets of FAN RPMs. The higher one is the CPU fan. The lower one is the 2 case fans.
In my opinion the U.2 (and 2.5" HDDs) drive(s) temperature should also be considered in calculating the case fan speed. Right now the logic is to use the CPU temperature (and NVIDIA temperature, if available) to set both the CPU fans and case fans.
My plan is to write a patch to do so, as right now my U.2 drive is slowing to a crawl and btrfs is throwing errors in syslog due to the high temps.
For now I wrote a script to force the 2 case fans (label INTF in /sys/class/hwmon) to max speed (trying to override what system76-power is doing):
#!/bin/bash
FAN1=hwmon5
FAN2=hwmon6
set_fan_speed() {
echo 255 > /sys/class/hwmon/$FAN1/pwm2
echo 255 > /sys/class/hwmon/$FAN2/pwm2
}
while true; do
set_fan_speed
sleep 0.25s
done
And it does seem to help (you can see I started it at about 10:15), but the drive is still running hot (but slowly coming down):
Update:
I said earlier maybe the case fans and CPU fans should be set independently. I don't think so anymore. All of them on MAX still aren't enough to cool the U.2 drives.
(Actually just yesterday I bought and installed 2 more U.2 drives, labeled "Intel1" and "Intel2" in the NVME Temperature plot below. I didn't include them in the plots above because they weren't installed until late in the day Aug 8th and didn't want to add confusion. But including now for completeness.)
You can see just before 11:00 AM I modified my fan override script to also set the CPU fan to MAX:
#!/bin/bash
FAN1=hwmon5
FAN2=hwmon6
set_fan_speed() {
echo 255 > /sys/class/hwmon/$FAN1/pwm2
echo 255 > /sys/class/hwmon/$FAN2/pwm2
echo 255 > /sys/class/hwmon/$FAN1/pwm1
}
while true; do
set_fan_speed
sleep 0.25s
done
Looks like the temps on the SSDs are stabilizing... above 70° for 2 out of the 3 U.2 SSD drives. At least the GPU fan doesn't have to work as hard...
I figured out how to get the HDD temperature from smartctl. The lifetime max temperature for /dev/sda is 72° C! (Even within the last few days it has hit 71° C.) The maximum recommend temperature (according to smartctl) is 55° C:
# smartctl -l scttemp /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.4.9-etr] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SCT Status Version: 3
SCT Version (vendor specific): 522 (0x020a)
Device State: Active (0)
Current Temperature: 38 Celsius
Power Cycle Min/Max Temperature: 35/38 Celsius
Lifetime Min/Max Temperature: 17/72 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 3 minutes
Temperature Logging Interval: 59 minutes
Min/Max recommended Temperature: 14/55 Celsius
Min/Max Temperature Limit: 10/60 Celsius
Temperature History Size (Index): 128 (46)
Index Estimated Time Temperature Celsius
47 2023-08-04 07:53 48 *****************************
48 2023-08-04 08:52 ? -
49 2023-08-04 09:51 33 **************
50 2023-08-04 10:50 ? -
51 2023-08-04 11:49 39 ********************
52 2023-08-04 12:48 ? -
53 2023-08-04 13:47 42 ***********************
54 2023-08-04 14:46 50 *******************************
55 2023-08-04 15:45 51 ********************************
56 2023-08-04 16:44 52 *********************************
57 2023-08-04 17:43 51 ********************************
58 2023-08-04 18:42 51 ********************************
59 2023-08-04 19:41 ? -
60 2023-08-04 20:40 47 ****************************
61 2023-08-04 21:39 49 ******************************
62 2023-08-04 22:38 50 *******************************
63 2023-08-04 23:37 50 *******************************
64 2023-08-05 00:36 48 *****************************
65 2023-08-05 01:35 46 ***************************
66 2023-08-05 02:34 49 ******************************
67 2023-08-05 03:33 51 ********************************
68 2023-08-05 04:32 52 *********************************
69 2023-08-05 05:31 49 ******************************
70 2023-08-05 06:30 48 *****************************
71 2023-08-05 07:29 48 *****************************
72 2023-08-05 08:28 49 ******************************
73 2023-08-05 09:27 50 *******************************
74 2023-08-05 10:26 49 ******************************
75 2023-08-05 11:25 48 *****************************
76 2023-08-05 12:24 48 *****************************
77 2023-08-05 13:23 48 *****************************
78 2023-08-05 14:22 49 ******************************
79 2023-08-05 15:21 49 ******************************
80 2023-08-05 16:20 51 ********************************
81 2023-08-05 17:19 50 *******************************
82 2023-08-05 18:18 50 *******************************
83 2023-08-05 19:17 50 *******************************
84 2023-08-05 20:16 51 ********************************
85 2023-08-05 21:15 51 ********************************
86 2023-08-05 22:14 51 ********************************
87 2023-08-05 23:13 52 *********************************
88 2023-08-06 00:12 52 *********************************
89 2023-08-06 01:11 51 ********************************
90 2023-08-06 02:10 51 ********************************
91 2023-08-06 03:09 51 ********************************
92 2023-08-06 04:08 ? -
93 2023-08-06 05:07 50 *******************************
94 2023-08-06 06:06 62 ***************************************+
95 2023-08-06 07:05 65 ***************************************+
96 2023-08-06 08:04 66 ***************************************+
97 2023-08-06 09:03 63 ***************************************+
... ..( 3 skipped). .. ***************************************+
101 2023-08-06 12:59 63 ***************************************+
102 2023-08-06 13:58 62 ***************************************+
103 2023-08-06 14:57 63 ***************************************+
104 2023-08-06 15:56 62 ***************************************+
105 2023-08-06 16:55 58 ***************************************
106 2023-08-06 17:54 59 ****************************************
107 2023-08-06 18:53 50 *******************************
108 2023-08-06 19:52 ? -
109 2023-08-06 20:51 44 *************************
110 2023-08-06 21:50 47 ****************************
111 2023-08-06 22:49 47 ****************************
112 2023-08-06 23:48 ? -
113 2023-08-07 00:47 24 *****
114 2023-08-07 01:46 ? -
115 2023-08-07 02:45 38 *******************
116 2023-08-07 03:44 46 ***************************
117 2023-08-07 04:43 46 ***************************
118 2023-08-07 05:42 54 ***********************************
119 2023-08-07 06:41 54 ***********************************
120 2023-08-07 07:40 57 **************************************
121 2023-08-07 08:39 ? -
122 2023-08-07 09:38 48 *****************************
123 2023-08-07 10:37 ? -
124 2023-08-07 11:36 32 *************
125 2023-08-07 12:35 51 ********************************
126 2023-08-07 13:34 65 ***************************************+
127 2023-08-07 14:33 66 ***************************************+
0 2023-08-07 15:32 68 ***************************************+
1 2023-08-07 16:31 66 ***************************************+
2 2023-08-07 17:30 51 ********************************
3 2023-08-07 18:29 51 ********************************
4 2023-08-07 19:28 69 ***************************************+
5 2023-08-07 20:27 71 ***************************************+
6 2023-08-07 21:26 71 ***************************************+
7 2023-08-07 22:25 66 ***************************************+
8 2023-08-07 23:24 64 ***************************************+
9 2023-08-08 00:23 64 ***************************************+
10 2023-08-08 01:22 63 ***************************************+
11 2023-08-08 02:21 62 ***************************************+
12 2023-08-08 03:20 62 ***************************************+
13 2023-08-08 04:19 62 ***************************************+
14 2023-08-08 05:18 58 ***************************************
15 2023-08-08 06:17 50 *******************************
16 2023-08-08 07:16 49 ******************************
17 2023-08-08 08:15 46 ***************************
18 2023-08-08 09:14 46 ***************************
19 2023-08-08 10:13 45 **************************
20 2023-08-08 11:12 45 **************************
21 2023-08-08 12:11 47 ****************************
22 2023-08-08 13:10 48 *****************************
23 2023-08-08 14:09 51 ********************************
24 2023-08-08 15:08 50 *******************************
25 2023-08-08 16:07 48 *****************************
26 2023-08-08 17:06 49 ******************************
27 2023-08-08 18:05 51 ********************************
28 2023-08-08 19:04 ? -
29 2023-08-08 20:03 42 ***********************
30 2023-08-08 21:02 46 ***************************
31 2023-08-08 22:01 46 ***************************
32 2023-08-08 23:00 46 ***************************
33 2023-08-08 23:59 47 ****************************
34 2023-08-09 00:58 51 ********************************
35 2023-08-09 01:57 54 ***********************************
36 2023-08-09 02:56 54 ***********************************
37 2023-08-09 03:55 52 *********************************
38 2023-08-09 04:54 53 **********************************
39 2023-08-09 05:53 55 ************************************
40 2023-08-09 06:52 52 *********************************
41 2023-08-09 07:51 54 ***********************************
42 2023-08-09 08:50 54 ***********************************
43 2023-08-09 09:49 ? -
44 2023-08-09 10:48 41 **********************
45 2023-08-09 11:47 ? -
46 2023-08-09 12:46 35 ****************
/dev/sda is the "Seagate Barracuda 2.5 5400" drive (ST5000LM000-2AN170) that came with the machine.