Rahix/tbot

Test case with picocom channel fails with "some subprocess(es) did not stop"

Closed this issue · 5 comments

Very simple test case fails with error:

$ newbot ts.hello_world.hi -c config.temp_config
tbot starting ...
├─Calling hi ...
│ ├─[local] picocom -b 115200 /dev/ttyUSB1
│ ├─[dm-board-linux] echo Hello World
│ │ ## Hello World
│ ├─test case finished
│ └─Fail. (1.717s)
├─Warning: Found dangling <class 'config.temp_config.DmBoardLinux'> instance in this context
├─Warning: Found dangling <class 'config.temp_config.DmBoard'> instance in this context
├─Warning: Teardown went wrong! A <class 'config.temp_config.DmBoardLinux'> instance is still alive.
│ Please report this to https://github.com/rahix/tbot/issues!
├─Warning: Teardown went wrong! A <class 'config.temp_config.DmBoard'> instance is still alive.
│ Please report this to https://github.com/rahix/tbot/issues!
├─Exception:
│ Traceback (most recent call last):
│ File "/usr/lib/python3.8/contextlib.py", line 510, in exit
│ if cb(*exc_details):
│ File "/usr/lib/python3.8/contextlib.py", line 120, in exit
│ next(self.gen)
│ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/connector/common.py", line 132, in _connect
│ yield ch
│ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/channel/channel.py", line 561, in exit
│ self.close()
│ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/channel/channel.py", line 538, in close
│ self._c.close()
│ File "/home/vemelin/.local/lib/python3.8/site-packages/tbot/machine/channel/subprocess.py", line 130, in close
│ raise tbot.error.TbotException("some subprocess(es) did not stop")
│ tbot.error.TbotException: some subprocess(es) did not stop
├─────────────────────────────────────────
└─FAILURE (1.789s)

code hello_world.py:

import tbot


@tbot.testcase
def hi():
    with tbot.ctx.request(tbot.role.BoardLinux) as lnx:
        lnx.exec0("echo",
                  "Hello",
                  "World")
        tbot.log.message(f"test case finished")

code temp_config.py:

import tbot
from tbot.machine import connector, board, linux


class DmBoard(connector.ConsoleConnector, board.Board):
    baudrate = 115200
    serial_port = "/dev/ttyUSB1"

    def connect(self, mach):
        return mach.open_channel("picocom", "-b", str(self.baudrate), self.serial_port)


class DmBoardLinux(board.Connector, linux.Ash):
    pass


def register_machines(ctx: tbot.Context):
    ctx.register(DmBoard, tbot.role.Board)
    ctx.register(DmBoardLinux, tbot.role.BoardLinux)

You need to fix it by increasing wait time in
tbot/tbot/machine/channel/subprocess.py
line 118
Why did you limit exponential backoff to 1 sec? It could be greater without affecting anything else
For me for t in range(11): works perfect. range(7) not enough.

Hi,

the backoff limit was intentionally chosen to be quite low because usually, when subprocesses don't stop, the correct solution is to add code that cleanly terminates the offending processes. In your case, the solution is probably to send a termination escape sequence to picocom:

    def connect(self, mach):
        ch = mach.open_channel("picocom", "-b", str(self.baudrate), self.serial_port)
        try:
          yield ch
        finally:
          ch.sendcontrol("A")
          ch.sendcontrol("Q")

That said, I'd be open to extending the backoff time a bit but displaying a warning after one second that something isn't right.

@Rahix thanks for your suggestion (just contextlib.contextmanager decorator is needed to add). It works, but it works even this way:

    @contextlib.contextmanager
    def connect(self, mach):
        ch = mach.open_channel("picocom", "-b", str(self.baudrate), self.serial_port)
        yield ch

Didn't dive deep to it.
And in both cases it can raise tbot.error.ChannelClosedError if spam this test very often. Maybe it's not important but increasing backoff time doesn't have this problem (for t in range(8): was enough for me)

So maybe worth it to extend backoff time or to add the trick above to documentation. Up to you.

but it works even this way:

Yeah, it is a timing race between tbot's timeout and your picocom process stopping. It will work some days and some days it will fail.

To avoid ugly situations like this, I have taken great care to not make use of timeouts in tbot whereever possible. You ran into one of the few remaining ones...

I have now created #120 to hopefully improve on this. Please let me know if this works for you.

or to add the trick above to documentation

So I think the real solution here is to add a dedicated picocom connector to tbot, which implements this behavior instead of just documenting what people need to do.

And actually, this picocom connector already exists - in a way: The AutoConsoleConnector from tbot_contrib includes support for picocom, using the exact code I showed above. But as I just noticed, this connector never made it into the documentation 🫣 Need to fix that asap...