1000001101000/Debian_on_Buffalo

LSCHLv2: ata driver is hard resetting the SATA-Link on boot

cYrAx157 opened this issue · 22 comments

hi there,
I have a linkstation live (ls-chlv2) and I installed debian bookworm using the debootstrap script.
When the device boots, I can hear that the harddisk is resetting a few times and after a few tries (sometimes it needs 20 or more retries) the boot succeeds.
In the dmesg, you can see that the ata driver is hard resetting the link.
The problem doesn´t occur on older debian versions (e.g. squeeze)
The HDD is a 4TB WD-Red "WD40EFRX".

dmesg.log

thanks in advance

I have a theory, though I wouldn't have expected it to apply to this combination of device/drive.

Somewhere between Bullseye (5.10) and Bookworm (6.1) the EXT4 filesystem started enabling TRIM by default. This can cause problems for SATA controllers that don't support TRIM, the port expander on the TS-XEL is the main one I'm familiar with.

Normally you would only encounter this issue with SSDs, but it looks like some WD RED drives report TRIM support. I wouldn't think this device would have that problem since it doesn't have a port expander but I'm not certain it wouldn't.

To protect against this issue I added "nodiscard" to the mount parameters for the rootfs. It looks like you are using EXT4 for other partitions on the disk as well. Try adding "nodiscard" to the options for all EXT4 filesystems in your /etc/fstab and see if that helps.

good theory, but, no luck :-(
I tried a few things, and the only thing I can say is, that the last working distro is "stretch" with the 4.9 kernel.
Bullseye doesn´t work too, same issue.
Any other ideas ?
Maybe it´s time for the ls-chlv2 to rest in peace....

Could you install smartmontools and then post the output of smartctl -a /dev/sda

here you go:

smartctl.log

I don´t think it´s an hardware issue on the hdd side, but I don´t know if it´s on the Linkstation side.
Yesterday, I tried a different HDD (Samsung), same problem.
The only thing I can say is, i get a stable Debian if I use "stretch" (4.9).
Because of that, I really think, the problem is related in software.

Does it help if I install debian-stretch and post a dmesg or something ?

I should probably start out by saying that I believe you when you say it’s likely not the hard drive. There are a bunch of similar issues out there for various sata controllers, a lot of those threads end in dismissive comments similar to “hard drives wear out bro”….. drives me crazy.

the SMART data helps me understand a little more about the drive. I might even have the same model here somewhere. It doesn’t look like the drive is reporting massive CRC errors/etc which could hint at some things.

I would have expected most of the kirkwood devices to have the same issues since I would have thought they all have the same sata controller …. but I would have expected to hear about this from a lot of people unless this is some how specific to this model or possibly types of drives.

Are you able to confirm whether the issue happens/happened on buster or bullseye? Narrowing down the kernel version that first had the issue might point us at what the issue might be.

You could also try EXT3, and possibly not mounting the data partition to see if there’s a filesystem component to the issue. The TRIM thing was specific to EXT4 on pretty recent kernels but there could be other such things that recently became default.

Hello, for my LS-XHL with 256 MB amount of memory, the last chance to have a workable device is to keep the 4.19 kernel from buster (10) debian. With all new kernels, it's hangs like you.

@1000001101000 okay I will try a few things and report back.
@rems28 yeah, I can remember that I get a stable behaviour with a 4.x kernel so I will try buster first

I compile new kernel 4.19.301 on this device with Debian .config from buster. After reboot, it continue to work like a charm and an uname command said me that Iam now on the new kernel.

compiled on the device? I bet that took a while! That’s a good step forward.

ideally you could now try 5.10 and confirm that is broken…. then start trying kernels in-between to narrow down what kernel the issue started with. Once you’ve narrowed it down sufficiently we can look at the changes to relevant sata/fs stuff and try to determine what caused the problem.

you can probably save a lot of time grabbing armel “marvel” kernel packages from debian’s archive instead of building each one.

https://snapshot.debian.org/

Yeah, it take approximately 30 hours on the device, but it's not important for me.
I build the kernel from the kernel.org source and do not apply any debian patches. Maybe one in the long list make some problem on kirkwood CPUs.
This one ? https://sources.debian.org/src/linux/6.5.13-1/debian/patches/bugfix/arm/arm-dts-kirkwood-fix-sata-pinmux-ing-for-ts419.patch/
As far as I see, the hard drive do not reboot one time at boot since I build the new kernel and it's a largely better behaviour for me.
For testing, I will try with an other hard drive.

That patch is in a device tree for a different device, it wouldn’t have any effect on yours.

If you wanted to determine if the problem was with my kernel or Debian’s specifically you’d need to build the same kernel version to compare.

I've tried with 6.1 kernel today and it hangs at boot.

Excellent, it sounds like you’ve now confirmed 4.19 works and 6.1 doesn’t. If bullseye didn't work 5.10 probably doesn’t but i’d check that next. From that point there are relatively few versions to check between 4.19 and 5.10.

if you can narrow it down to that point we might be able to figure out what changes to the sata driver, filesystems, etc might be the cause and start working on a fix

I think I might have found the cause, though it took me a few days of "research".

As above my hard drive would hard reset several times during the boot process. Examining dmesg output showed an error involving MPP pin 10 being assigned to power-hdd when already assigned to serial 0 (aka UART0). Or words to that effect. The dmesg log also showed that the hard drive was often being connected at lower than expected speeds.

After checking through the kirkwood-88f6281 hardware documentation, kirkwood.dtsi, and kirkwood-6281.dtsi, I edited the Bookworm device tree kirkwood-lschlv2.dts file, changing "serial@12000" to "serial@12100" so that UART1 using MPP pins 13 and 14 would be active, instead of UART0 which previously used MPP pins 10 and 11. (Though serial output might be useful to help debug what's going on the connection points on the PCB aren't known to me.)

After making this change, generating and installing a new "debian_bookworm_armel.img" the boot process no longer has the hard drive resetting, and dmesg output no longer contains any warnings. The NAS reliably boots from cold in about 65 seconds. The dmesg logs also show the hard drive consistently using UDMA/133.

The above is probably not the best way to resolve the problem, but I'm happy with it so far. I've learnt a lot about the Linux boot process, device trees, and Marvel Kirkwood processors, which has kept me entertained.

Anyway, thanks for Debian_on_Buffalo.

(I would do a pull request, but don't really know how!)

Sounds like solid work to me.

Might explain why I’ve not seen that with mine since I typically test with really low power ssds.

I’ll see if I can repeat your findings.

I was able to confirm making that change for the UART made the error about MPP10 go away. I went ahead and updated the repo version right away.

the new dtb can be insalled by

  • copy it to /etc/flash-kernel/dtbs/
  • run flash-kernel to generate new boot files
  • reboot

I haven't verified the serial console works, trying that will be a task for another day.
https://web.archive.org/web/20160829014742/http://buffalo.nas-central.org/wiki/Serial_and_JTAG_port_LS-XHL

Hello, do you thing that is suitable for ls-xhl too ?

Almost certainly.

Could you verify if you're getting that same MPP10 message in dmesg?

I do not have cable for serial debug, but on a blank hard drive, I've tested the changes from the ls-xhl.dts file from the kernel.org source and the drive works like a charm on bullseye now. Before it was impossible to use that debian version after upgrade and wad impossible to use from scratch with debian installer. So I think that pjt-15e have certainly found the solution of the issue.
I will now try an upgrade to bookworm and report the status tomorrow.

I think I might have found the cause, though it took me a few days of "research".

As above my hard drive would hard reset several times during the boot process. Examining dmesg output showed an error involving MPP pin 10 being assigned to power-hdd when already assigned to serial 0 (aka UART0). Or words to that effect. The dmesg log also showed that the hard drive was often being connected at lower than expected speeds.

After checking through the kirkwood-88f6281 hardware documentation, kirkwood.dtsi, and kirkwood-6281.dtsi, I edited the Bookworm device tree kirkwood-lschlv2.dts file, changing "serial@12000" to "serial@12100" so that UART1 using MPP pins 13 and 14 would be active, instead of UART0 which previously used MPP pins 10 and 11. (Though serial output might be useful to help debug what's going on the connection points on the PCB aren't known to me.)

After making this change, generating and installing a new "debian_bookworm_armel.img" the boot process no longer has the hard drive resetting, and dmesg output no longer contains any warnings. The NAS reliably boots from cold in about 65 seconds. The dmesg logs also show the hard drive consistently using UDMA/133.

The above is probably not the best way to resolve the problem, but I'm happy with it so far. I've learnt a lot about the Linux boot process, device trees, and Marvel Kirkwood processors, which has kept me entertained.

Anyway, thanks for Debian_on_Buffalo.

(I would do a pull request, but don't really know how!)

wow, nice find !! It works like a charm on my lschl-v2 too ! Thanks for that. That patch should be pushed to debian´s repo too !

Update to Debian 12 is good. Do not see any problem at the moment.
Is it a better idea to make a patch directly to kernel team ?

I’ve updated the ls-xhl dtb with the same change and generated new installer images.