Problematic use of locking

I'd like to bring up an issue raised by Ansible Galaxy user jbgo, in their Galaxy rating of Stouts.wale.

The relevant part of their comment:

The reason is the file locking mechanism in the wale.sh wrapper script causes WAL segments to be skipped when performing a base backup. The missing WAL segments will prevent the standby server from recovering properly. Here is the problematic lock mechanism:

Stouts.wale/templates/wale.sh.j2

Line 44 in 3a9e07d

[ -f $LOCKFILE ] && return 0 || return 1

If you're lucky to have configured enough wal_keep_segments, you can manually wal-push the missing segments and complete a successful restore before postgresql deletes them. I was fix this by using wal-push directly for the archive_command instead of the wal-e.sh wrapper script.

I can understand why it is useful to guard the base backup push command with a lock, as this command is typically executed via cron, and this behaviour prevents simultaneous long-running base backups executed by cron (e.g. base backup every day, but the previous day's backup is still running). However, I can't understand why guarding the wal-push/wal-fetch commands with a lock is necessary. And, as pointed out by jbgo, this even seems like it could lead to backups not being executed correctly.

I would also suggest that the locking mechanism of creating and deleting a lock file as is done currently is inadequate. If the process dies, the lock file remains, and backups stop. Using flock to get a lock on a lockfile would resolve this issue. e.g.

if [ -z "$flock" ] ; then
    lockopts="-w 0 $LOCKFILE"
    exec env flock=1 flock $lockopts $0 "$@"
fi

The -w 0 option causes the program to exit immediately if the lock is still held. This also moves the lock guard code to a single place in the script.

I'd be happy to to submit a pull request with changes to this effect, if there's agreement on this in principle.

@klen any feedback on this?