simonsobs/socs

Timeout crashes HWP PID main process

BrianJKoopman opened this issue · 0 comments

Saw this this morning on satp1:

2024-10-28 11:50:39.622 2024-10-28T11:50:39+0000 DecodedResponse(msg_type='measure', msg='Current frequency = 1.999', measure=1.999)
2024-10-28 11:50:39.622 2024-10-28T11:50:39+0000 Finding target CHWP Frequency
2024-10-28 11:50:40.123 2024-10-28T11:50:40+0000 ['R014007D0']
2024-10-28 11:50:40.123 2024-10-28T11:50:40+0000 DecodedResponse(msg_type='read', msg='Setpoint = 2.0', measure=2.0)
2024-10-28 11:50:40.123 2024-10-28T11:50:40+0000 Finding CHWP Direction
2024-10-28 11:50:42.626 2024-10-28T11:50:42+0000 Caught timeout waiting for response from PID controller. Trying again...
2024-10-28 11:50:44.290 2024-10-28T11:50:44+0000 ['R02400000']
2024-10-28 11:50:44.290 2024-10-28T11:50:44+0000 DecodedResponse(msg_type='read', msg='Direction = Forward', measure=0)
2024-10-28 11:50:44.491 2024-10-28T11:50:44+0000 Finding CHWP Frequency
2024-10-28 11:50:44.991 2024-10-28T11:50:44+0000 ['\x00\x00\x00\x00\x00\x00\x00\x00\x00X012.001']
2024-10-28 11:50:44.991 2024-10-28T11:50:44+0000 DecodedResponse(msg_type='error', msg='Unrecognized response', measure=None)
2024-10-28 11:50:44.991 2024-10-28T11:50:44+0000 Error reading freq: Unrecognized response
2024-10-28 11:50:44.991 2024-10-28T11:50:44+0000 ['\x00\x00\x00\x00\x00\x00\x00\x00\x00X012.001']
2024-10-28 11:50:44.992 2024-10-28T11:50:44+0000 DecodedResponse(msg_type='error', msg='Unrecognized response', measure=None)
2024-10-28 11:50:44.992 2024-10-28T11:50:44+0000 Error reading freq: Unrecognized response
2024-10-28 11:50:44.992 2024-10-28T11:50:44+0000 ['\x00\x00\x00\x00\x00\x00\x00\x00\x00X012.001']
2024-10-28 11:50:44.992 2024-10-28T11:50:44+0000 DecodedResponse(msg_type='error', msg='Unrecognized response', measure=None)
2024-10-28 11:50:44.992 2024-10-28T11:50:44+0000 Error reading freq: Unrecognized response
2024-10-28 11:50:44.993 2024-10-28T11:50:44+0000 main:104 CRASH: [Failure instance: Traceback: <class 'ValueError'>: Could not get current frequency
2024-10-28 11:50:44.993 /usr/lib/python3.10/threading.py:1016:_bootstrap_inner
2024-10-28 11:50:44.993 /usr/lib/python3.10/threading.py:953:run
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/twisted/_threads/_threadworker.py:49:work
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/twisted/_threads/_team.py:192:doWork
2024-10-28 11:50:44.993 --- <exception caught here> ---
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:269:inContext
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/twisted/python/threadpool.py:285:<lambda>
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/twisted/python/context.py:117:callWithContext
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/twisted/python/context.py:82:callWithContext
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py:984:_running_wrapper
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/socs/agents/hwp_pid/agent.py:174:main
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/socs/agents/hwp_pid/agent.py:121:_get_data_and_publish
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/socs/agents/hwp_pid/agent.py:33:get_pid_state
2024-10-28 11:50:44.993 /opt/venv/lib/python3.10/site-packages/socs/agents/hwp_pid/drivers/pid_controller.py:248:get_freq
2024-10-28 11:50:44.993 ]
2024-10-28 11:50:44.993 2024-10-28T11:50:44+0000 main:104 Status is now "done".
2024-10-28 13:28:49.827 2024-10-28T13:28:49+0000 start called for main
2024-10-28 13:28:49.827 2024-10-28T13:28:49+0000 main:105 Status is now "starting".
2024-10-28 13:28:49.829 2024-10-28T13:28:49+0000 main:105 Status is now "running".
2024-10-28 13:28:49.830 2024-10-28T13:28:49+0000 Connected to PID controller
2024-10-28 13:28:49.831 2024-10-28T13:28:49+0000 Finding CHWP Frequency
2024-10-28 13:28:50.332 2024-10-28T13:28:50+0000 ['X011.999']
2024-10-28 13:28:50.332 2024-10-28T13:28:50+0000 DecodedResponse(msg_type='measure', msg='Current frequency = 1.999', measure=1.999)
2024-10-28 13:28:50.332 2024-10-28T13:28:50+0000 Finding target CHWP Frequency
2024-10-28 13:28:50.833 2024-10-28T13:28:50+0000 ['R014007D0']
2024-10-28 13:28:50.833 2024-10-28T13:28:50+0000 DecodedResponse(msg_type='read', msg='Setpoint = 2.0', measure=2.0)

Recovery was just restarting the main process, which you can see in the logs. Resolving this will contribute to the effort in #721.

socs image tag: v0.5.1-22-g7d2f158-dev