Failed quick restart attempts

Question

Failed quick restart attempts

Closed this issue 3 years ago · 0 comments

Right now the starter will try to start an instance up to 100 times in a row if the instance fails quickly after restart.
100 times in quick succession is way to much. Imagine each start generates a core file or extra data on disk while restarting. Starting so often so row quickly will thus quickly eat up disk space.
The number "100" is currently hard-coded in the starter code, but it should be configurable. It's fine that the default value is hard-coded, but it should be changeable via a command-line option. A much lower default value would be sensible.

Once the starter has accumulated 100 failed restarts, it exits with exit code 0. I am not sure if this is expected. When running the starter from a script and it fails, I would rather expect the starter to return a non-zero exist code.

Just try:

rm -rf temp
arangodb --mode single --starter.data-dir temp --all.does-not-exist=1
echo $?

This will print 0. My preference would be that this returns a non-zero exit code. It needs to be checked what common tools such as systemd or supervisord do upon different exit codes. My preference here would be that after so many failures in a row, the starter would be declared failed and not restarted.