tillitis/tkey-ssh-agent

[windows] tk-sign app hangs during long term stress testing

Closed this issue · 2 comments

Using the test-loop.sh script from the communication test branch, the test-loop.sh eventually hangs after some number of signatures, for example:

loop count: 77286, seconds passed: 50256
Connecting to device on serial port COM8 ...
Public Key from device (UID 00010:20:3:04050607): ebb95cac6cb6fb662a231363cfeb6d5a132866ed2fee55dc8efc7bc8c7726ffe
Sending a 128 bytes message for signing.
Signature over message by device: 1a0f93609b37952e7a4018930ecd5aeb97883dac32e6115e10ecefc07b93db4ae3b83f4bc3a923e6237e3b1dbdfd612634392b8fcc1b6c1f8379aface3a0fe0d
Signature verified.

Inspecting the process monitor, it does appear that tk-sign is the part that has hung:

$ ps -e |grep tk-sign
    59211   63534   63534       6240  pty4      197609 08:18:50 /c/Users/matt/Other-Repos/tillitis-key1-apps/tk-sign
     3629   63545   63545      19564  pty2      197609 20:48:03 /c/Users/matt/Other-Repos/tillitis-key1-apps/tk-sign
    57328   63531   63531      16824  pty0      197609 19:14:59 /c/Users/matt/Other-Repos/tillitis-key1-apps/tk-sign
    37230   63573   63573      19276  pty3      197609 10:55:21 /c/Users/matt/Other-Repos/tillitis-key1-apps/tk-sign

This testing was performed on a desktop computer running Windows 10, and go version "go1.19.3 windows/amd64" installed from the official package. The golang apps were compiled in a mingw shell, using (I believe):

go build ./cmd/runapp
go build ./cmd/tk-sign

The riscv signerapp app.bin firmware was compiled in an Ubuntu docker container, and copied into the ./apps/signerapp directory. Similarly, the application fpga gateware was compiled on the same Ubuntu docker container, using the version in the communication_test_automation branch (which should be unchanged from the current main branch). This gateware was then flashed onto 4 mta1-usb-v1 boards (two prototypes, and two from the OSFC production batch, but all with similar components), and the usb sticks were plugged into USB ports on the desktop motherboard. Then, a test loop script was started for each device, targeting the specific serial ports like this:

USB_DEVICE=COM99 ./test-loop.sh

The devices were left to run overnight. Two hung early, one at 4928 loops, and one at 13498 loops, and a third hung much later at 77286 loops. The fourth is still running, and is up to 93500 signatures:

image

In windows, so far all of the hangs have been after the 'Signature verified.' message displays. This test was also repeated on Linux with a different failure mode, which will be posted as a second issue.

I suspect that there are at least two issues here: a surface issue that the tk-sign app isn't able to time out on failed communications, and a root cause issue which is somewhere lower than that. I'd be happy to run these tests again if there's a debug output that can be enabled that would be more useful.

dehanj commented

I have been investigating USB related issues lately and been able to come to a conclusion.
You are right that there are two issues.

The major one is that the CH552 can drop bytes in certain circumstances in the direction of UART -> USB, i.e., TKey to client.
The second, one is that the implementation of the client programs using readfull(), which does not have any timeout and will read until the buffer is full - and if a byte is dropped it will never come and then you have a hanging client program.

I have a fix for it in this PR which more or less just protects the input/output buffers and making sure interrupts are not fiddling with the indices of said buffers.

I'm not able to reproduce the issue after the fix.
So, if you want to try it. Feel free, otherwise I will close this issue now.