iqlusioninc/tmkms

Protocol error: and USB error

matilote opened this issue · 3 comments

Hi team,

We've recently noticed a series of errors and 5-10min downtime of our 2 validator nodes. We saw a series of logs without an error explanation such as:

Nov 13 15:19:09 localhost tmkms[414223]: 2023-11-13T15:19:09.804901Z ERROR tmkms::client: [cosmoshub-4@tcp://127.0.0.1:26658] signing operation failed: signature error: protocol error: protocol error: USB error: No such device (it may have been disconnected)
Nov 13 15:19:20 ts-db4-srv-010 tmkms[414223]: 2023-11-13T15:19:20.995922Z ERROR tmkms::client: [cosmoshub-4@tcp://127.0.0.1:26658] protocol error:
Nov 13 15:19:32 ts-db4-srv-010 tmkms[414223]: 2023-11-13T15:19:32.003878Z ERROR tmkms::client: [cosmoshub-4@tcp://127.0.0.1:26658] protocol error:
Nov 13 15:19:43 ts-db4-srv-010 tmkms[414223]: 2023-11-13T15:19:43.011879Z ERROR tmkms::client: [cosmoshub-4@tcp://127.0.0.1:26658] protocol error:

The tmkms is running with 0.12.2 version. There tmkms service was restarted and we have observed these logs shortly after the restart. There are 5 networks in total configured in tmkms.toml.

Would you please advise what could be an issue there? Note that 3/5 networks were signing without issues and the 2 validators were affected seqeuntially (first one was down and then second network began to struggle with above errors).

We noticed similar error when trying listing down keys:

error: couldn't get object info for asymmetric key #4: protocol error: protocol error: USB error: No such device (it may have been disconnected)

Any leads on how to avoid such errors and not encounter larger downtimes when making changes to the config and restarting the service? (Going to upgrade to 0.13.0 soon, but wanted to first check if you've seen such errors before and if you know potential root causes).

The main error is: USB error: No such device (it may have been disconnected)

You can also check dmesg to see if there are any relevant USB events occurring at the same time.

If you see the YubiHSM being disconnected as registered by the kernel, perhaps try reseating it in the USB port?

[Mon Nov 13 15:20:42 2023] usb 2-8: reset full-speed USB device number 3 using xhci_hcd
[Mon Nov 13 15:20:43 2023] usb 2-13: reset full-speed USB device number 4 using xhci_hcd
[Mon Nov 13 15:20:43 2023] usb 2-8: reset full-speed USB device number 3 using xhci_hcd
[Mon Nov 13 15:20:43 2023] usb 2-8: reset full-speed USB device number 3 using xhci_hcd
[Mon Nov 13 15:20:44 2023] usb 2-13: reset full-speed USB device number 4 using xhci_hcd
[Mon Nov 13 15:20:44 2023] usb 2-8: reset full-speed USB device number 3 using xhci_hcd
[Mon Nov 13 15:20:53 2023] usb 2-8: reset full-speed USB device number 3 using xhci_hcd
[Mon Nov 13 15:20:54 2023] usb 2-13: reset full-speed USB device number 4 using xhci_hcd
[Mon Nov 13 15:20:54 2023] usb 2-8: reset full-speed USB device number 3 using xhci_hcd
[Mon Nov 13 15:20:54 2023] usb 2-8: usbfs: process 414227 (stride-1@tcp://) did not claim interface 0 before use

There are indeed reset events occurring around that time. I am not able to physically access the device for the time being. Not sure if there is anything that can be potentially done remotely? I think we can close this issue as it seems to be related more to the YubiHSM itself.

Looks like the reset is done on each service restart. We have enabled connector_server in order to interact with keys while not touch KMS. This removed the protocol errors and improved overall operations.