Linux openSeaChest reports drive in `standby_z` state regardless of the actual state.

Question

Linux openSeaChest reports drive in `standby_z` state regardless of the actual state.

Closed this issue a year ago · 13 comments

Issue opened as a request to split off #113 (comment) from the original issue.

Attaching output of:

# openSeaChest_PassthroughTest -d /dev/sg2 --runPTTest --ptDriveHint ata --ptTypeHint sat > passthroughTest.txt

passthroughTest.txt

Note: there are some issues that might be also attached to a misbehaving USB bridge, e.g. the drive doesn't spin down automatically and if forcing it to spin down using --spinDown and --transitionPower standby_z, the drive stays powered down for ~15 minutes before spinning up again.

Answer 1 · 2023-11-13T21:54:47.000Z

@mplzik,
Thanks for sharing that output!
It looks like there should be a way to get the command completion based on what I see in the results.

I'll work on adding this into our code and push it as soon as I can.
Once I get it in and the CI builds it, I will share a build for you to test out and let me know if it seems to be working properly.

Answer 2 · 2023-11-14T07:28:31.000Z

Huge thanks. Also, not being familiar with this problem domain -- is this something that should be implemented as any kind of quirk for linux kernel (should a kernel bug be filed as well)?

Also, just out of curiosity, since I noticed this issue when trying to power the disk down when not in use. Is there a chance that the power management parameters are not being configured properly, preventing the drive from entering the deepest sleep for longer time, or there's a fair chance that the enclosure itself emits some commands that are waking up the drive from its deepest sleep levels?

Answer 3 · 2023-11-14T18:23:02.000Z

is this something that should be implemented as any kind of quirk for linux kernel (should a kernel bug be filed as well)?

Not necessarily. This software (and others like hdparm and smartmontools) are attempting to talk through the adapter to the drive itself rather than letting the adapter talk to the drive by using a special command called SAT ATA passthrough.
While there is a part of the linux kernel that does some ATA passthrough, for the most part it just uses standard SCSI commands to read, write, flush, and identify the drive. ATA passthrough is better thought of as an optional feature rather than a requirement....it is super helpful for diagnostics and data collection among other things though.
In an ideal world the adapter would be capable of all the translations defined in the SAT (SCSI to ATA Translation) specification which would include the ability to configure power mode timers (at least from SAT-3 and later. I can't remember if earlier translations supported this).

There may be part of the Linux kernel that issues SAT ATA passthrough commands, but you would need to provide them more detail about how to work with this adapter, which they may or may not do. I doubt the kernel would be using ATA passthrough to spin a drive down...it's more likely that the SCSI Start-stop-unit command is being used to enter standby or letting the drive's timer expire.
The changes I have made were to adjust how the command results are returned to the software.
What this adapter does is it accepts a bit in the command called "check condition" which is meant to return the ATA drive's command results, but instead this adapter is returning zeroes.
So the workaround in this case is to use the SAT ATA passthrough with the protocol field set to "Return response information" but a secondary workaround is also needed to say "ignore the value of the extend bit" in that response because it is not being set correctly by the adapter.
With both of these changes in place, it should correct the problem with openSeaChest.

The part of that log I asked for that tells all these is at the end:

TURF:11
SCSI Hacks: RW6, RW10, RW16, NLP, NMSP, SUPSOP, REPALLOP, MXFER:1048576
ATA Hacks: SAT, A1, RS, RSTD, RSIE, TPSIU, CHKE, MPTXFER:130560

These are all short-hand ways to describe the workarounds necessary to get the maximum capability from the adapter.
I wrote this test after years of trial and error and manually debugging these issues.
It finds most issues and eliminates most manual debugging at this point, but every now and then I do come across a new adapter that this automated test just will not work correctly with.

Is there a chance that the power management parameters are not being configured properly, preventing the drive from entering the deepest sleep for longer time, or there's a fair chance that the enclosure itself emits some commands that are waking up the drive from its deepest sleep levels?

There are likely multiple factors that could be coming into play here.
If the ATA passthrough commands are going through, then the changes to the timers should be taking effect.
If these are EPC timers (I think so from what you had in the other issue), then these should be able to be saved by the drive to follow. The pre-EPC changes to standby timers and APM are volatile, so they are not saved by the device when it is power cycled.
I have seen and tested some adapters that do query the drive from time to time which cause it to spin back up in some cases.
There are some parts of the OS kernel that may also ping the drive from time to time. It can be for a variety of reasons, but a common one is to check SMART status. I don't think SMART status is the case here though since in general most OS's stick to the SCSI command set and I didn't see a complete SMART translation in the log you shared. So the linux kernel itself may try a passthrough request, but even doing this to check SMART is unlikely to spin up the drive in most cases.
The last thing that can happen is if for some reason the USB adapter, or another one on the same USB controller is causing some kind of trouble, the OS could reset it. This reset will get passed to the adapter and most likely once it is complete the OS starts device discovery again. This always seems to include reading the MBR/GPT partition table to see if a device has anything to mount. This read will definitely cause a drive to spin up.
In Linux, if there was a reset then it would be logged in dmesg.
There could be other causes like some daemons or other software just checking the drive to see if it has a filesystem to write to that are causing the drive to spinup by trying to read it.
There is not enough information here to really know for sure what is causing it to spin back up right now.

Answer 4 · 2023-11-14T23:58:25.000Z

@mplzik,

Can you test this attached build and let me know if it corrects the issue you were seeing with your adapter?
openSeaChest-release-Release-23.09-linux-x86_64-portable.tar.xz.zip

I had to zip it to attach it 🙄 , so after unzipping it will be a tar.xz file to decompress, but this should be a portable build.
Test the --checkPowerMode option and feel free to do other additional testing for other things like SMART if you would like and see if it seems to be reporting as you would expect.
If something seems off, can you attach a verbose dump from the tool?
Example:

openSeaChest_Basics -d <handle> -i -v 4 > verboseIdent.txt

As long as the -v 4 is on the command line it will dump a verbose output of the raw commands and results so I can see if anything else looks off.

Answer 5 · 2023-11-15T07:29:11.000Z

@vonericsen this looks really good; the power state seems to change now:

# ./openSeaChest_PowerControl --checkPowerMode -d /dev/sg2
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.4.0-6_1_1 X86_64
 Build Date: Nov 14 2023
 Today: Wed Nov 15 07:43:10 2023	User: root
==========================================================================================

/dev/sg2 - 004-2M2101 -  - 0125 - ATA
Device is in the PM1: Idle state and the device is in the Idle_c power condition

# find /mnt/disk >/dev/null
^C
# ./openSeaChest_PowerControl --checkPowerMode -d /dev/sg2
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2023 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.4.0-6_1_1 X86_64
 Build Date: Nov 14 2023
 Today: Wed Nov 15 07:44:10 2023	User: root
==========================================================================================

/dev/sg2 - 004-2M2101 -  - 0125 - ATA
Device is in the PM1: Idle state and the device is in the Idle_a power condition

Huge thanks for fixing that! When looking at other things that might look off I noticed that openSeaChest_PowerControl --SATInfo is not reporting back the model number, despite of smartctl -a providing one; let me know if it makes sense to fix this and I'll be happy to do any kind of testing.

As for the power management issues, I don't see any kernel messages indicating USB device reset and since the problem also came up when the drive was not mounted and LVM was not set up, I'd suspect the enclosure polling the drive in an way that prevents it from spinning down -- it's a multi-drive enclosure that supports some level of RAID, so I'd guess it might be doing some magic in JBOD mode as well.

Also, hats off for doing all this debugging using a test suite. Hardware is a nasty stuff when it comes to various quirks and suite able to catch majority of odd behavior is definitely an invaluable tool.

Answer 6 · 2023-11-15T18:25:30.000Z

@mplzik,
Glad to hear that helped!

Huge thanks for fixing that! When looking at other things that might look off I noticed that openSeaChest_PowerControl --SATInfo is not reporting back the model number, despite of smartctl -a providing one; let me know if it makes sense to fix this and I'll be happy to do any kind of testing.

Yeah, this should be fixed too. I might need to tweak a setting or two with what I did due to a possible false positive in the passthrough test. It could be the TPSIU workaround is not correct...that is most likely to cause this, but I think the verbose output could help confirm that.
Can you run this and post the output here?

openSeaChest_PowerControl -d <handle> -i --SATInfo -v 4 > verboseInfo.txt

Answer 7 · 2023-11-15T18:40:32.000Z

Sure. I'm using the binaries you sent me:

# ./openSeaChest_PowerControl -d /dev/sg2 -i --SATInfo -v 4 > verboseInfo.txt

verboseInfo.txt

nit: w.r.t. my power management issue, looks like I had a service running that would periodically run smartctl -a on the drive, which made it spin up again. After disabling the service, it looks like the drive stays in standby_z until something accesses it.

Answer 8 · 2023-11-15T19:43:57.000Z

Thanks for the log.
Some command is causing the adapter to stop responding for some reason.
It starts this issue when attempting to read the extended self-test results log (7h) for some odd reason.
It completed reading the previous log without issues.

Here are two more things to try with that build I shared earlier to help troubleshoot before I make more changes. Unplug the adapter, then plug it back in between each of these runs since the end of the log is showing an error like it completely stopped responding.

openSeaChest_PowerControl -d <handle> -i --SATInfo -v 4 --forceATADMA > verboseInfoFDMA.txt

openSeaChest_PowerControl -d <handle> -i --SATInfo -v 4 --forceATAPIO > verboseInfoFPIO.txt

These force openSeaChest to use a different protocol in the command which may work around this or rule out some weird behavior I've seen on other adapters in the past.

Answer 9 · 2023-11-16T22:26:30.000Z

I've made one more tweak that may or may not help.
If you can give this a try and report the verboseInfo from the --SATInfo again, that would help me to mark this as done/resolved.
I don't like seeing the adapter stop responding like in your last log and want to get the tweaks just right to keep it functioning properly.

openSeaChest-release-Release-23.09-linux-x86_64-portable.tar.xz.zip

Answer 10 · 2023-11-17T15:13:11.000Z

Thanks a lot; with the change, I was able to read the model number correctly in the ATA section of --SATInfo output; attaching the output.

# ./openSeaChest_PowerControl -d /dev/sg2 --SATInfo >SATInfo.txt

SATInfo.txt

Answer 11 · 2023-11-17T16:00:10.000Z

Awesome! I'm glad that worked!
This output looks a lot better and more like what I would expect and most likely resolved the adapter crash from the previous build.

I'll pull this into the release I'm working on currently.
Feel free to use that last build I shared with you for now since it includes all the fixes and features I've pulled in so far.
I'm hoping to get it wrapped up in a couple of weeks.
I'm not aware of any other major issues in it at this time.

If you run into any other issues, please report them and we'll do our best to resolve them!

Answer 12 · 2023-11-17T16:37:22.000Z

Definitely will report if some something odd shows up; huge thanks for the help. :) Also, as for not getting responses from the disk -- it did not escalate to actually losing connection to the disk itself (regular operations still worked), at least not any way I would notice.

Answer 13 · 2023-12-07T17:55:54.000Z

Marking this as closed since this code is in the latest release (v23.12) and it has been merged to both master and develop branches.

If you have any other trouble, please reopen this issue or create a new one and we will do our best to resolve it!