[Bug]: Job error due to race condition in persisting vs reading workspace cache
Closed this issue · 0 comments
saurabh-prakash commented
Is there an existing issue for this?
- I have searched the existing issues
What is the current behavior?
Some jobs are erring out because of a race condition in when we are persisting the workspace in discovery phase vs when we are reading it in execution phase.
Discovery phase calls the /test-list
API after discovery completion and it schedules the execution pods for running. However, the persisting of the workspace has still not happened yet, and therefore, if the execution pod spins up early, it doesn't find the workspace cache and exits with non-zero exitCode eventually erring out the job and causing cleanups. This cleanup will also fail the discovery pod which might be writing into the area that got cleaned up.
What is the expected behavior?
The jobs shouldn't fail.
Steps To Reproduce
- Load test TAS Cloud on
cloudsploit
repository and we observe job failures 1/5 times occasionally.
Version
Test-at-scale Cloud