LNST-project/lnst

lnst-agent crashes on controller KeyboardInterrupt during Iperf

Closed this issue · 5 comments

Running a standard SimpleNetworkRecipe, waiting until IperfClient and IperfServer are both running, then sending SIGINT to controller (or just hitting ^C) crashes both agents with the following exception:

Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/lnst-MdZhQDa5-py3.11/bin/lnst-agent", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/lnst/lnst/Agent/__main__.py", line 45, in main
    agent.run()
  File "/root/lnst/lnst/Agent/Agent.py", line 1000, in run
    self._process_msg(msg[1])
  File "/root/lnst/lnst/Agent/Agent.py", line 1065, in _process_msg
    job.join()
    ^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'join'

Controller calls machine.cleanup(), which calls Agent._job_context.cleanup() through RPC. That in turn calls _kill_all_jobs(), which calls kill on all jobs stored in _dict. Those jobs then send "job_finished" message through the child pipe. Once all jobs are killed this way, _dict is emptied.

And then _process_msg() is called on the "job_finished" message, which tries to .join() the job (which is None, because it wasn't found in JobContext's _dict. Not sure how to fix this. @olichtne @jtluka

I think this could be solved by moving the join() and set_finished() calls to JobContext class, and imo it makes sense have all jobs related tasks within the JobContext. The Agent'would only forward the message to Controller.

I think this could be solved by moving the join() and set_finished() calls to JobContext class, and imo it makes sense have all jobs related tasks within the JobContext. The Agent'would only forward the message to Controller.

That would mean that we'll have to call job.join() explicitly also in kill_job() and probably in other places, too. Not sure.

So any time someone would call job.kill(), they'd also have to call job.join() and job.set_finished()?

What if we just didn't clear the job dict in JobContext.cleanup() and maybe instead introduced JobContext.pop_job(), which would be something similar to JobContext.get_job(), but would also delete the job from the dict? We could then call this method in the "job_finished" handler instead of JobContext.get_job().

WDYT? @jtluka @olichtne

i'll have to try to reproduce this to inspect the state before i can be sure on how to solve this. i'll try to do that this week.