[Bug]: Job error due to race condition in persisting vs reading workspace cache

Question

[Bug]: Job error due to race condition in persisting vs reading workspace cache

Closed this issue 3 years ago · 0 comments

Is there an existing issue for this?

I have searched the existing issues

What is the current behavior?

Some jobs are erring out because of a race condition in when we are persisting the workspace in discovery phase vs when we are reading it in execution phase.

Discovery phase calls the /test-list API after discovery completion and it schedules the execution pods for running. However, the persisting of the workspace has still not happened yet, and therefore, if the execution pod spins up early, it doesn't find the workspace cache and exits with non-zero exitCode eventually erring out the job and causing cleanups. This cleanup will also fail the discovery pod which might be writing into the area that got cleaned up.

What is the expected behavior?

The jobs shouldn't fail.

Steps To Reproduce

Load test TAS Cloud on cloudsploit repository and we observe job failures 1/5 times occasionally.

Version

Test-at-scale Cloud