SCSI device scan fails on some kernels
errordeveloper opened this issue · 11 comments
While testing custom kerenls, I often noticed seeing the following error in pod status:
write /sys/class/scsi_host/host1/scan invalid argument
This is caused by:
Lines 321 to 348 in c1d24f9
I wonder if this is provides some critical functionality in Kubernetes context, or a configuration parameter could be added to turn it off without a major compromise? From the original PR #163 it sounds as if this was not a critical feature as such, but I'm not sure if it has become one since then.
I guess aside from adding a toggle for this, there could be a way to fix this that e.g. by turning this error into a warning or being less opportunistic on how the SCSI scan is performed?
@jodh-intel @devimc I have enabled debug logs, but not been able to tell as what comes out in the journal is rather very verbose and mostly unreadable as most messages get doubly quoted etc. Is there a tool that people use to parse it? I can see some references to SCSI, but the context looks like XML that was escaped a bunch of times, so and I don't know whether I'm supposed to find any clues in that XML-looking message, or it's some random noise - hence I didn't bother 😸
To be clear, my pod doesn't have any volumes, it would help to know if this arises from the fact that kata passes default secret as a SCSI device to the VM, I have no idea how to tell whether this is the case or it's something else 🤔
I've double checked the kernel config that I used, and there are indeed a few difference, albeit SCSI is generally enabled.
#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_MQ_DEFAULT is not set
# CONFIG_SCSI_PROC_FS is not set
#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
And kata's config has the following:
#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_MQ_DEFAULT=y
CONFIG_SCSI_PROC_FS=y
#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
But given that, I don't see anything that would affect the sysfs interface, or am I missing something?
@errordeveloper - https://github.com/kata-containers/tests/tree/master/cmd/log-parser might be able to help you as it can convert the logs to various formats.
@errordeveloper any update on this (and/or the related #768). Is this still an issue/blocker?
@devimc - any thoughts on those config differences for SCSI above?
It is still an issue, I am using a fork from my PR. I think this could be really to do with many disks being passed from the host, isn't that what it is attempting to do?
@errordeveloper A bit later here. But looking at the kernel configs you posted, those should be sufficient.
Can you add debug logs to check if the path actually exists in your case /sys/class/scsi_host/host1/scan
?
I would also like to understand what storage driver you are using and the pod yaml you are using.
@errordeveloper any updates on this issue?
I am not working on this project right now, but the PoC can be found in https://github.com/errordeveloper/kube-test-cluster-operator and if someone wants to dig into this, I am happy to guide them.
We still have this functionality in the new agent here:
https://github.com/kata-containers/kata-containers/blob/2.0-dev/src/agent/src/device.rs#L167
Since we don't see this error on the kernel we provide we are closing this issue.
Feel free to open this issue if needed.
If this is causing a problem for usesr with custom kernels, the agent could be modified (PRs welcome! ;) to check for the existence of /sys/bus/scsi/
maybe?