intel/ledmon

ledmon stopped working after upgrade from debian 10 (buster) to debian 11 (bullseye)

casparsmit opened this issue · 22 comments

Hi,

I have an issue with ledmon after upgrading my debian installation from debian 10 to debian 11.

Debian 10 uses ledmon version 0.90-0.1
Debian 11 uses ledmon version 0.95-1

My system has a Supermicro X9DRD-7LN4F motherboard and a BPN-SAS2-846EL1 24 slots backplane with an LSI expander.

On version 0.90-0.1 ledctl works fine and i can identify my disks without issue.
However after upgrading to debian 11 (and ledmon to 0.95-1) it always throws the following error:

ledctl: /dev/sdX: device not supported
ledctl: IBPI LOCATE: missing block deivce(s)... pattern ignored.

When i then manually downgrade the ledmon package on debian 11 to version 0.90-0.1 it starts working again so it doesn't seem to be a kernel/driver issue but an issue in the ledmon package itself.

If i need to provide more information or do any testing, please let me know.

Kind regards,
Caspar

Hi @casparsmit,
First, I would like to determine for which controller bug was introduced. Please provide:

# ls -l /sys/block
# ledctl --list-controllers
# ledctl failure=/dev/sdX --all

Last command should create more logs. Please check your /var/ledmon.log too.

Hi,

Here are the outputs:

ls -l /sys/block

total 0
lrwxrwxrwx 1 root root 0 Apr 11 12:55 dm-0 -> ../devices/virtual/block/dm-0
lrwxrwxrwx 1 root root 0 Apr 11 12:55 dm-1 -> ../devices/virtual/block/dm-1
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop0 -> ../devices/virtual/block/loop0
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop1 -> ../devices/virtual/block/loop1
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop2 -> ../devices/virtual/block/loop2
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop3 -> ../devices/virtual/block/loop3
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop4 -> ../devices/virtual/block/loop4
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop5 -> ../devices/virtual/block/loop5
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop6 -> ../devices/virtual/block/loop6
lrwxrwxrwx 1 root root 0 Apr 11 12:55 loop7 -> ../devices/virtual/block/loop7
lrwxrwxrwx 1 root root 0 Apr 11 12:55 sda -> ../devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Apr 11 12:55 sdb -> ../devices/pci0000:00/0000:00:02.2/0000:03:00.0/host6/port-6:0/expander-6:0/port-6:0:0/end_device-6:0:0/target6:0:0/6:0:0:0/block/sdb

Note: /dev/sda is connected to the onboard SATA port of the motherboard. /dev/sdb is the only device connected atm to the backplane / LSI SATA controller.

btw, the SAS controller is a LSI 2308:
#lspci
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

ledctl --list-controllers

/sys/devices/pci0000:00/0000:00:02.2/0000:03:00.0 (SCSI)
/sys/devices/pci0000:00/0000:00:1f.2 (AHCI)

ledctl failure=/dev/sdb --all

ledctl: /dev/sdb: device not supported
ledctl: IBPI FAILURE: missing block device(s)... pattern ignored.

/var/log/ledmon.log doesn't give more info unfortunatly

Apr 11 13:19:41 ERROR: /dev/sdb: device not supported
Apr 11 13:19:41 WARNING: IBPI FAILURE: missing block device(s)... pattern ignored.

Thanks in advance,
Caspar

Hi,
Thanks for sharing it. I need your support because I don't have the hardware available.
First, could you check 0.93 release (the first release with automake support).
https://github.com/intel/ledmon/releases/tag/0.93
Please follow README, it is helpful.
If issue is not reproducible there, then please try to use git bisect to determine when regression was added.
For testing, you can compile and run <ledmon_dir>src/ledctl locate=/dev/sdX directly, without installation.
Let me know if you need our support.

Thanks,
Mariusz

Hi Mariusz,

Thanks for the reply, i tried to compile v0.93 but i seem to be hitting a compile bug:

https://pastebin.com/XCe1PRzE

I searched for the error and found the following thread (looks the same):

https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1711683.html

Kind regards,
Caspar

Please start with this one then:
d468759

Ok i did the following, please correct me if i did anything wrong:

git checkout tags/0.93
git cherry-pick d468759
./autogen.sh
./configure --enable-systemd
make
./src/ledctl locate=/dev/sdb

Still the same error: device not supported

Thanks, looks good.
So, please use cherry-pick to analyze commits between 0.90 and 0.93.
Please note that there is no automake support so you will need to compile it via make.

Ok, i compile and tested v0.90 and that one works.

Then i compiled v0.91-fixed and that version already doesn't work anymore with the following output (maybe this gives a hint?):

LOTS of repeated messages saying:

IPMI Error: c1
ledctl: Unable to determine Dell Server type

and finally:

ledctl: /dev/sdb: device not supported
ledctl: IBPI LOCATE: missing block device(s)... pattern ignored.

So somewhere between 0.90 and 0.91 something broke.

I started cherry-picking all commits between 0.90 and 0.91 and it broke at commit f498582

That is surprising.
Could you start ledmon.service for a while, stop it and then check ledctl?
I want to generate shared memory conf.

The best will be to kill ledmon process, to prevent it from removing shared memory file.
Could you try?

I let ledmon run for a while, did some ledctl commands but nothing seems to happen.

Looking at the code a shared conf file should be created in the root (/ledmon.conf) but monitoring the root directory it doesn't seem to get created at all. (maybe / isn't the best place for it either).

Also after killing ledmon it doesn't show up.

It is created in /dev/shm/:

snprintf(share_conf_path, sizeof(share_conf_path), "/dev/shm%s",
              LEDMON_SHARE_MEM_FILE);

please look into doc:
https://man7.org/linux/man-pages/man3/shm_open.3.html

For portable use, a shared memory object should be identified by
a name of the form /somename; that is, a null-terminated string
of up to NAME_MAX (i.e., 255) characters consisting of an initial
slash, followed by one or more characters, none of which are
slashes.

Please kill process by

kill `pidof ledmon` -9

To prevent ledmon clean-up to be executed.

Could you also retest with this commit reverted on 0.95 for confirmation?

Okay, sorry about that, i'm not a developer.

It maybe nothing but "cat" gives a different output then "more" (trailing newlines)

cat /dev/shm/ledmon.conf

BLINK_ON_INIT=1
BLINK_ON_MIGR=1
LOG_LEVEL=3
LOG_PATH=/var/log/ledmon.log
RAID_MEMBERS_ONLY=1
REBUILD_BLINK_ON_ALL=1
INTERVAL=10
WHITELIST=
BLACKLIST=

more /dev/shm/ledmon.conf

BLINK_ON_INIT=1
BLINK_ON_MIGR=1
LOG_LEVEL=3
LOG_PATH=/var/log/ledmon.log
RAID_MEMBERS_ONLY=1
REBUILD_BLINK_ON_ALL=1
INTERVAL=10
WHITELIST=
BLACKLIST=

I shall now test 0.95 with this commit reverted

On the latest git:

git revert f498582

Auto-merging src/ledmon.c
CONFLICT (content): Merge conflict in src/ledmon.c
Auto-merging src/ledctl.c
CONFLICT (content): Merge conflict in src/ledctl.c
Auto-merging src/config_file.h
Auto-merging src/config_file.c
CONFLICT (content): Merge conflict in src/config_file.c
CONFLICT (modify/delete): src/Makefile deleted in HEAD and modified in parent of f498582 (Use shared memory to share configuration between ledmon and ledctl.). Version parent of f498582 (Use shared memory to share configuration between ledmon and ledctl.) of src/Makefile left in tree.
error: could not revert f498582... Use shared memory to share configuration between ledmon and ledctl.
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add ' or 'git rm '
hint: and commit the result with 'git commit'

Ok, i edited the 0.95 src and manually reverted the f498582 commit, compiled, and it works again. So definatly that commit breaks something.

Thank you for detailed analysis.
We will try to determine the root cause and fix it. I will create internal task for that.
I don't close this report, we will keep it open.
You can try to determine root cause yourself and submit the patch :)

Ok i found the problem i think.

I noticed that the shared config file had:

RAID_MEMBERS_ONLY=1

So when i created a MD RAID on one of my disks it did work!

So since ledctl and ledmon are now sharing the same config with this commit, non RAID drives are now "not supported" by ledctl if this option is set in /etc/ledmon.conf

So all in all this is not a bug but by design.

It would be nice though if ledctl could "ignore" that setting because i would like to be able to identify ANY disk in the system (not just RAID members) but while monitoring the system (using the ledmon daemon) i'm only interested in status of actual raid disks.

Yeah, agree. At least warning is expected.
We will discuss it to choose best solution.

Thanks for your debug. Really appreciate that!

Hello,
Problem is complex. By default ledctl is doing one ledmon loop, and sets states of drives accordingly even if it is not requested The behavior is configurable via "listed-only" flag.
We decided to drop this functionality and as a result, there will be no need to pass many of those parameters however it could result in different behavior in some non standard cases. We accept regression the risk here,

After fixing it we will release new ledctl version shortly.

Hello @casparsmit,
Please take a look into #114 and retest on your side.