CGRU/cgru

exception in service.py causes tasks to be skipped

ultra-sonic opened this issue · 7 comments

Hi Timur,

we are currently running into an issue where exceptions (like IOError [Errno 28] No space left on device) cause a task to be skipped and want to ask if this would be the right place to fix this:

m_update_status = af::TaskExec::UPSkip;

or should this rather be done here:
m_progress->state = m_progress->state | AFJOB::STATE_SKIPPED_MASK;

I cant find any other reference to af::TaskExec::UPSkip so I would propose setting the status to AFJOB::STATE_ERROR_MASK instead of AFJOB::STATE_SKIPPED_MASK in this case.
I also cannot imagine a case where an empty command would be sent on purpose.

do you think this is safe to do?

@sebastianelsner

Hi, Oliver!
Zero command size (an empty string) is designed to allow service to decide to skip a task for some reason.
When service check something and want to ask server to skip the task, it can set command = "".
So if your service clears command for any reason, task will be skipped.
If you want make service to set ERROR on en empty command, you can change UPSkip to UPFailedToStart in this line:

m_update_status = af::TaskExec::UPSkip;

May be i should make service to skip task in some other way, not by an empty command.
Also it will be good to have some service initialization check, an on its fail send some UPFailedToInitializeService.

Thanks for the quick reply Timur!!! Much appreciated!

May be i should make service to skip task in some other way, not by an empty command. Also it will be good to have some service initialization check, an on its fail send some UPFailedToInitializeService.

This would be great to have for sure!

If you want make service to set ERROR on an empty command, you can change UPSkip to UPFailedToStart

will af::TaskExec::UPFailedToStart really be treated as an ERROR and avoid the failing rendernode? i dont see this case setting m_progress->state = m_progress->state | AFJOB::STATE_ERROR_MASK;

Not on server, on render in this line:
https://github.com/CGRU/cgru/blob/master/afanasy/src/render/taskprocess.cpp#L121

I already added sevice initialization check. May it will solve your problem.
The next step (today-tomorrow) i will create a special function for service to skip (not by zero command).
And on zero command it will be UPFailedToStart (error for server).

You can already test it in production.
There was not any changes on server side.
All changes just on render side, an not in critical parts.
Tomorrow i will update our farm.

Great news Timur! Thanks for the quick fix! You're the man!