keras-team/keras-tuner

Distributed keras tuner fails in case when chief oracle server startup takes more than 5 minutes

Opened this issue · 0 comments

We observed that in some cases distributed keras tuner fails.

It's caused by the fix to prevent keras tuner hanging forever - 5 min RPC timeouts were introduced (see #957).
Now if keras chief worker startup takes longer than 5 minutes, then the client gives up and fails the whole tuning process.

Normally RPC server startup is quick, but in some cases it might take slightly longer.

The planned solution is to increase the client timeout to 1h. We still need the timeout to prevent tuner clients from hanging forever. We need the timeout to be high enough so that chief oracle server always has enough time to start.