jhuckaby/Cronicle

Support performing an action when a job is aborted

Opened this issue · 5 comments

Summary

Support performing an action when a job is aborted.

Steps to reproduce the problem

When a job is aborted, there are cases when additional actions need to be taken in addition to just terminating tasks. I ran into this with a job that runs a ZFS scrub with the -w option, which waits for the asynchronous scrub to finish before returning. When I abort the job, the scrub command terminates, but the scrub itself is still running. To actually stop the scrub, one must issue an additional command just for this purpose.

There may also be jobs that, when voluntarily aborted, additional cleanup/recovery actions need to be taken.

Your Setup

Just a single server.

Operating system and version?

Linux Mint 22

Node.js version?

v20.17.0

Cronicle software version?

Version 0.9.59

Are you using a multi-server setup, or just a single server?

Single

Are you using the filesystem as back-end storage, or S3/Couchbase?

Filesystem

Can you reproduce the crash consistently?

Log Excerpts

Aborting a job will send a SIGTERM to the outermost process. Can you use the Shell Plugin and provide a shell wrapper that traps SIGTERM and acts on it? Example:

#!/bin/bash

# Define a function to handle the SIGTERM signal
cleanup() {
    echo "Caught SIGTERM signal. Running cleanup..."
    # Add your custom cleanup commands here
    # For example, stop services, clean temporary files, etc.
    echo "Cleanup done."
    exit 0
}

# Trap SIGTERM signal
trap cleanup SIGTERM

# Run your job here
/path/to/my/script.sh

That works! Thanks for the advice.

EDIT: Oops. See below.

Sorry. I spoke too soon. It doesn't actually cancel the ZFS scrub. Here is the job log after I aborted the job:

# Job ID: jm4vyxval0b
# Event Title: Scrub fivebays
# Hostname: foghorn
# Date/Time: 2024/12/19 18:44:38 (GMT-5)

+ trap cleanup SIGTERM
+ zpool scrub -w fivebays
Caught SIGTERM, killing child: 511024
Child did not exit, killing harder: 511024

# Job failed at 2024/12/19 18:45:12 (GMT-5).
# Error: Job Aborted: Manually aborted by user: admin
# End of log.

Here is the script:

#!/bin/bash
set -x

# Function to handle the SIGTERM signal (when job is aborted)
cleanup() {
    echo "Caught SIGTERM signal. Cancelling scrub of fivebays."
    zpool scrub -s fivebays
    exit $?
}

trap cleanup SIGTERM

zpool scrub -w fivebays

I notice that the SIGTERM message in the log is not the same as the one in the script. Is something preempting it?

Ah, I think I see the problem:

Caught SIGTERM, killing child: 511024
Child did not exit, killing harder: 511024

So, Cronicle gives the child 10 seconds to shut down after sending the SIGTERM. If it does not die, it sends a SIGKILL (which cannot be trapped).

You can increase the timeout in the configuration here: https://github.com/jhuckaby/Cronicle/blob/master/docs/Configuration.md#child_kill_timeout