simonsobs/socs

Failing tests on HWP Emulation Branch

Closed this issue · 6 comments

I'm trying to figure out why tests are failing in the testing github action, while they all succeed when I run locally. These are tests that are running on on the HWP Emulation / PID restructure PR (#606), however the failing tests are in the PMX common and integration test files, which are not touched at all by #606.

For example, this testing run has two failing tests with logs below... We haven't seen these before, so its possible it is some race condition that is occurring in the testing docker, however neither Brian nor I can reproduce on our local machines. This also only happens when running tests all at once, and not when running the PID and PMX tests by themselves, as brian mentions here

__________________________ test_pmx_cmd_check_current __________________________

tcp_emulator = <socs.testing.device_emulator.DeviceEmulator object at 0x7f7a7f15b670>

    @pytest.mark.integtest
    def test_pmx_cmd_check_current(tcp_emulator):
        cmd = create_command()
        tcp_emulator.define_responses({'MEAS:CURR?': '1'})
    
        msg, val = cmd.user_input('C?')
>       assert val == 1
E       assert -999.0 == 1

common/test_pmx.py:125: AssertionError
---------------------------- Captured stderr setup -----------------------------
DeviceEmulator: Device emulator waiting for tcp client connection
------------------------------ Captured log setup ------------------------------
INFO     DeviceEmulator:device_emulator.py:245 Device emulator waiting for tcp client connection
----------------------------- Captured stdout call -----------------------------
Connected to TCP IP 127.0.0.1 at port 19002
WARNING! Could not get correct current value! | Response = "-999.0"
----------------------------- Captured stderr call -----------------------------
DeviceEmulator: Client connection made from ('127.0.0.1', 33034)
DeviceEmulator: msg='SYST:REM'
DeviceEmulator: response='test'
DeviceEmulator: responses set to {'MEAS:CURR?': '1'}
DeviceEmulator: default response set to 'None'
------------------------------ Captured log call -------------------------------
INFO     DeviceEmulator:device_emulator.py:247 Client connection made from ('127.0.0.1', 33034)
DEBUG    DeviceEmulator:device_emulator.py:259 msg='SYST:REM'
DEBUG    DeviceEmulator:device_emulator.py:270 response='test'
INFO     DeviceEmulator:device_emulator.py:336 responses set to {'MEAS:CURR?': '1'}
INFO     DeviceEmulator:device_emulator.py:338 default response set to 'None'
___________________________ test_hwp_rotation_set_on ___________________________

wait_for_crossbar = None
kikusui_emu = <socs.testing.device_emulator.DeviceEmulator object at 0x7f7a7efd93a0>
run_agent = None, client = OCSClient('hwp-pmx')

    @pytest.mark.integtest
    def test_hwp_rotation_set_on(wait_for_crossbar, kikusui_emu, run_agent, client):
        responses = default_responses.copy()
        responses.update({
            'output 1': '',
            'output?': '1'
        })
        kikusui_emu.define_responses(responses)
        resp = client.set_on()
        print(resp)
        print(resp.session)
        assert resp.status == ocs.OK
>       assert resp.session['success']
E       assert False

integration/test_hwp_pmx_agent_integration.py:56: AssertionError
---------------------------- Captured stdout setup -----------------------------
/home/runner/work/socs/socs/tests
---------------------------- Captured stderr setup -----------------------------
DeviceEmulator: Device emulator waiting for tcp client connection
DeviceEmulator: Client connection made from ('127.0.0.1', 50824)
DeviceEmulator: msg='meas:curr?'
DeviceEmulator: encountered error 'meas:curr?'
DeviceEmulator: msg='meas:curr?'
DeviceEmulator: encountered error 'meas:curr?'
------------------------------ Captured log setup ------------------------------
INFO     DeviceEmulator:device_emulator.py:245 Device emulator waiting for tcp client connection
INFO     DeviceEmulator:device_emulator.py:247 Client connection made from ('127.0.0.1', 50824)
DEBUG    DeviceEmulator:device_emulator.py:259 msg='meas:curr?'
INFO     DeviceEmulator:device_emulator.py:167 encountered error 'meas:curr?'
DEBUG    DeviceEmulator:device_emulator.py:259 msg='meas:curr?'
INFO     DeviceEmulator:device_emulator.py:167 encountered error 'meas:curr?'
----------------------------- Captured stdout call -----------------------------
OCSReply: OK : Operation "set_on" is currently not running (FAILED).
  set_on[session=2]; status=done with ERROR 0.003792 s ago, took 15.5 s
  messages (3 of 3):
    1706803956.113 Status is now "starting".
    1706803971.602 CRASH: [Failure instance: Traceback: <class 'RuntimeError'>: Could not connect to PID controller
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:913:errback
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:981:_startRunCallbacks
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:1075:_runCallbacks
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:1946:_gotResultInlineCallbacks
--- <exception caught here> ---
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:1996:_inlineCallbacks
/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/python/failure.py:519:throwExceptionIntoGenerator
../socs/agents/hwp_pmx/agent.py:140:set_on
../socs/agents/hwp_pmx/agent.py:328:_process_actions
../socs/agents/hwp_pmx/agent.py:30:process
/home/runner/work/socs/socs/socs/agents/hwp_pmx/drivers/PMX_ethernet.py:93:turn_on
/home/runner/work/socs/socs/socs/agents/hwp_pmx/drivers/PMX_ethernet.py:62:send_message
/home/runner/work/socs/socs/socs/agents/hwp_pmx/drivers/PMX_ethernet.py:41:_establish_connection
]
    1706803971.603 Status is now "done".
  other keys in .session: op_code, data
{'session_id': 2, 'op_name': 'set_on', 'op_code': 6, 'status': 'done', 'success': False, 'start_time': 1706803956.1130335, 'end_time': 1706803971.6025774, 'data': {}, 'messages': [[1706803956.1130335, 'Status is now "starting".'], [1706803971.6016514, "CRASH: [Failure instance: Traceback: <class 'RuntimeError'>: Could not connect to PID controller\n/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:913:errback\n/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:981:_startRunCallbacks\n/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:1075:_runCallbacks\n/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:1946:_gotResultInlineCallbacks\n--- <exception caught here> ---\n/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/internet/defer.py:1996:_inlineCallbacks\n/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/twisted/python/failure.py:519:throwExceptionIntoGenerator\n../socs/agents/hwp_pmx/agent.py:140:set_on\n../socs/agents/hwp_pmx/agent.py:328:_process_actions\n../socs/agents/hwp_pmx/agent.py:30:process\n/home/runner/work/socs/socs/socs/agents/hwp_pmx/drivers/PMX_ethernet.py:93:turn_on\n/home/runner/work/socs/socs/socs/agents/hwp_pmx/drivers/PMX_ethernet.py:62:send_message\n/home/runner/work/socs/socs/socs/agents/hwp_pmx/drivers/PMX_ethernet.py:41:_establish_connection\n]"], [1706803971.6025774, 'Status is now "done".']]}
----------------------------- Captured stderr call -----------------------------
DeviceEmulator: responses set to {'meas:volt?': '2', 'meas:curr?': '1', ':system:error?': '+0,"No error"\n', 'stat:ques?': '0', 'volt:ext:sour?': 'source_name', 'output 1': '', 'output?': '1'}
DeviceEmulator: default response set to 'None'
------------------------------ Captured log call -------------------------------
INFO     DeviceEmulator:device_emulator.py:336 responses set to {'meas:volt?': '2', 'meas:curr?': '1', ':system:error?': '+0,"No error"\n', 'stat:ques?': '0', 'volt:ext:sour?': 'source_name', 'output 1': '', 'output?': '1'}
INFO     DeviceEmulator:device_emulator.py:338 default response set to 'None'

I think I understand a little bit about what's happening here... enough to make the tests pass on the Emulation branch, but not the full story.

I think this is related to the following race condition in tests like the one below:

@pytest.mark.integtest
def test_hwp_rotation_set_on(wait_for_crossbar, kikusui_emu, run_agent, client):
responses = default_responses.copy()
responses.update({
'output 1': '',
'output?': '1'
})
kikusui_emu.define_responses(responses)
resp = client.set_on()
print(resp)
print(resp.session)
assert resp.status == ocs.OK
assert resp.session['success']

Here the run_agent and kikusui_emu fixtures make sure we have an initialized agent and emulator on startup, however, the emulator responses are not defined until inside the test.

With the newly restructured agents, the OCS main task begins on startup, before the device-emulator responses are defined, so it will fail to get a response from the emulator the initial iterations of the main process`. This may have been causing some strange behavior.

To fix these tests, I made it so that the device-emulator fixture always started the emulator with a set of responses that satisfied the "main" operation, and then in each test, the emulator responses were updated to be whatever they needed to be for that particular test, which makes it so that the tests pass.

A few things that still confuse me:

  • I don't know how exactly this causes the error above, since from the error message it looks like the main task has made it to the "process" fn of the SetOn action.
  • Why is this happening now in this branch, and not on the main branch?
  • Why is this only happening in github actions?

This could all be caused by subtleties of the race condition and error handling of the main task, but that is not very satisfying.

I think one thing we should try to look into is bringing OCS logs from integration into stdout. I think part of what makes debugging this so difficult is that it is currently impossible to see agent logs in the failing tests.

I notice that DeviceEmulator starts a thread. But it doesn't explicitly wait for that thread to terminate, when shutdown is called. I think it would be wise to make sure there is no outstanding activity there.

Also, both shutdown and the thread (_read_socket) call _conn.close() and _sock.close(). Probably at roughly the same time. I don't know if that's a problem but it's definitely suspicious to be doing those, at about the same time, from two different threads.

Those are good suggestions... I'm not exactly sure if that explains this problem, but worth doing. I think we should keep the connection local to the _read_socket fn and wrap it in a try/finally block, and then just wait for that thread to end in shutdown.

Yeah, agreed. The symptom when the connection didn't close cleanly (and the agent didn't handle dropped connections well) was typically the agent hanging and preventing further tests from running, if I recall correctly. So I'm also not sure this is the problem we're seeing here, but I'd be happy to implement better handling there.

I think one thing we should try to look into is bringing OCS logs from integration into stdout. I think part of what makes debugging this so difficult is that it is currently impossible to see agent logs in the failing tests.

The agents get run as a subprocess, and do raise their stdout and stderr when they run into problems shutting down. Perhaps we could also raise this if requested for debugging.

We were able to track down the problem causing Brian's tests to fail (from this comment) as well.

To debug, brian was able to recreate in a vagrant VM, and sent me the configuration so I could test. On my computer I was not able to recreate, and brian found that his laptop was running with a slower CPU than my desktop. I was able to recreate after throttling the cpu in the VM.

The problem was easy to find when checking the agent logs, which was that the tests were submitting actions before the main process had reached the clear_queue here, so those actions were being cancelled before the main process entered its loop.

This was fixed by having tests wait until the main process had some session.data before running any tasks.