ibm-s390-linux/s390-tools

inconsistent/incomplete /dev/disk/by-id links

sharkcz opened this issue · 14 comments

We are experiencing a situation where the /dev/disk/by-id/... symlinks are inconsistent across reboots. Sometimes links for all disks/dasds are present, sometimes only a (different) subset is present.

[root@openshift-8 ~]# ll /dev/disk/by-id/
total 0
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5422-part1 -> ../../dasda1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5422-part2 -> ../../dasda2
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-0X5622 -> ../../dasdc
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5622-part1 -> ../../dasdc1
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-0X5722 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-0X5722-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22.00000000000027200000000000000000-part1 -> ../../dasda1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22.00000000000027200000000000000000-part2 -> ../../dasda2
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22-part1 -> ../../dasda1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0230.22-part2 -> ../../dasda2
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22 -> ../../dasdc
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22.00000000000027200000000000000000 -> ../../dasdc
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22.00000000000027200000000000000000-part1 -> ../../dasdc1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0232.22-part1 -> ../../dasdc1
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22 -> ../../dasdd
lrwxrwxrwx. 1 root root 11 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 12 Dec  7 10:19 ccw-IBM.750000000FRB71.0233.22-part1 -> ../../dasdd1
[root@openshift-8 ~]# ll /dev/disk/by-id/
total 0
lrwxrwxrwx. 1 root root 11 Dec  7 10:48 ccw-0X5722 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:48 ccw-0X5722-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 11 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22 -> ../../dasdd
lrwxrwxrwx. 1 root root 11 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000 -> ../../dasdd
lrwxrwxrwx. 1 root root 12 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000-part1 -> ../../dasdd1
lrwxrwxrwx. 1 root root 12 Dec  7 10:48 ccw-IBM.750000000FRB71.0233.22-part1 -> ../../dasdd1

environment is Fedora 35 with kernel-5.14.18-300.fc35.s390x and s390utils-core-2.17.0-2.fc35.s390x (version shouldn't matter much as etc/udev/rules.d/59-dasd.rules hasn't changed for long time, except the scheduler setting)

Fedora 35 with kernel-5.15.6-200.fc35.s390x doesn't seem to have the /dev/disk.by-id directory at all, looking further ...

Related: https://bugzilla.redhat.com/show_bug.cgi?id=1963192

hmm, now I understand it even less, booting with kernel 5.14.18 with rd.udev.debug the journal is full of LINK messages for the by-id symlinks from the 59-dasd rules file, but nothing is there, not even the /dev/disk/by-id directory ...

I wonder if there is a race condition between creating the actual symlinks and creating the /dev/disk/by-id/ directory ...

I think messages like this explain the missing symlinks

...
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-0X5722', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-0X5722', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22.00000000000027200000000000000000', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-0X5722', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-0X5722', removing
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: Updating old device symlink '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', which is no longer belonging to this device.
Dec 10 10:48:30 openshift-8.s390.bos.redhat.com systemd-udevd[420]: dasdd: No reference left for '/dev/disk/by-id/ccw-IBM.750000000FRB71.0233.22', removing

ping me for a full log

I'm currently unable to access the RedHat BZ. I need to get an account first.
Can you share more details about your setup? How are the DASDs configured? Did you use chzdev -e to have a persistent
configuration? Is it always the same DASDs that have this issue or is it random?

I've tried several reboots on an LPAR with 10 DASDs persistently configured using chzdev -e on a freshly installed
F35 (tried both 5.15.6-200.fc35.s390x and 5.14.10-300.fc35.s390x + s390utils-2:2.17.0-2.fc35.s390x) but
wasn't able to reproduce the issue so far.

my environment is

  • z13 with z/VM 6.4.0
  • the guest is Fedora 35 with kernel-5.14.18-300.fc35.s390x and systemd-udev-249.7-2.fc35.s390x
  • DASDs configured with rd.dasd= and /etc/dasd.conf
  • what links are present/missing is purely random

The original report is from OCP/RHEL-8.x with z/VM 7.2.0 on z13 and z15.

I suspect there might be something wrong with udev or kernel handling the devices, rather than the udev rules in s390utils which are pretty straightforward.

20220105-1028-udev.log.zip
created with

journalctl -b | grep systemd-udev > 20220105-1028-udev.log
ll /dev/disk/by-id/ >> 20220105-1028-udev.log
lsdasd

I was able to reproduce the issue myself now with:

  • LPAR z14
  • F35 5.14.10-300.fc35.s390x and systemd-udev-0:249.7-2.fc35.s390x
  • DASDs configured with /etc/dasd.conf

I didn't see the problem when using chzdev to persistently configuring the devices. Maybe you can give it a try
to see how this behaves on your setup. Make sure to remove all DASD from /etc/dasd.conf and then enable them
via chzdev -e <devices> (you can specify a range here as well, e.g. 9300-930f).

I'll dig a bit deeper to see where the problem might be.

Have you also removed the rd.dasd= definitions from the kernel parameter line and used the zdev "rootfs mode" purely?

Right now I am testing with /etc/dasd.conf completely removed (both from system and from initrd) and still no 100% success (4x all links created, 1x no links at all, 1x links for dasdc1 only).

I believe using the zdev persistent config doesn't matter. I have converted my system fully to zdev for dasds and I am still getting random result with the "by-id" links. I suspect the problem is deeper in udev or kernel.