adobe-apiplatform/user-sync.py

Veterans Affairs experiencing sporadic UST sync hanging but not exiting

mmiddlet opened this issue · 2 comments

Sporadic UST syncs hanging but not exiting as evidenced in the log files. Because the batch job was hung, from the OS perspective it appeared the batch job was still running. Therefore, future jobs did not run.

Support ticket E-001245190 was opened with Adobe technical support. It was closed on 5/23.

Steve Cordero stated that additional data on the hang is unavailable without a debug log which is not written.

Looking at the logs I see that the last entry we received was the run on 17/05/2024 at 18:26:16.
The pattern of runs show that the tool is triggered 4 times per day at around 00:00, 6:00, 12:00, 18:00.
I do see the last run on the 17th for the 6pm run, but then nothing until the 20th, 09:11:21 -> first log entry incoming from your side, then no other traffic that day. Next traffic recorded was the 21st, for the midnight run, then continuously until today.
The tool's logic is to log to the file the starting command line args as soon as it starts, then checks for the presence of the 'lockfile' inside its folder. All the functions being called during this stage are between try... except... blocks so in case any error is picked up, it would've get logged to file. Something must've 'helped' the tool not to continue/freeze.

With that in mind, a very obvious and critical fact that takes a UST failure out of the equation is that the tool is supposed to be triggered by the task scheduler 4 times per day, but for the 18th you only have one entry at midnight, which did not reach our servers, but the line got logged as UST started event. So, where are the next log entries for the subsequent runs at 6am, then 12, etc? Even if a process is still running, the task would start a new process at the specified time and the tool would log again to file that it started, "========== Start Run (User Sync version:" is ALWAYS logged when it starts, if log to file is on.
The above tells me that the tool was not triggered by the task scheduler for the rest of the runs and for the one that started - it got hung for whatever reason and that reason is local to that machine. If there were security events logged, I'd start the analysis from there.

Since this appears to be environmental and not something we can necessarily address in the UST, I am closing the issue.