zertrin/duplicity-backup.sh

Detect and handle stale locks

Opened this issue · 6 comments

As a system administrator I want to be able to trust my backup mechanism to run even if it failed once without having to manually check it every time

Observed behavior

  • Backup script runs
  • Script gets killed from whatever reason, is unable to remove its lock
  • Consecutive backups fails with lock held by XXXX
  • No notification is sent using slack channel though it's configured

Expected behavior

  • The stale lock is detected because of the registered PID is no longer running
  • Stale lock is removed, backup process is continued

OR

  • Lock is detected
  • Deadlock (PID no longer running) is detected
  • Slack channel gets notification about deadlock

Logs

# cat /var/log/duplicity/duplicity-2016-05-15_01-12.txt
--------    START DUPLICITY-BACKUP SCRIPT for docker01   --------

Attempting to acquire lock /var/log/duplicity/backup.lock
lock failed, could not acquire /var/log/duplicity/backup.lock
lock held by 3124
# ps aux | grep 3124
root      7661  0.0  0.0  11712   668 pts/3    S+   02:10   0:00 grep --color=auto 3124
#

You're right, this would be a very useful enhancement. Not sure I will be able to look into implementing it soon. Anyone feel free to propose a pull request before I do 😉

Regarding the second expected behavior, the script does send an email but not the other notification methods (e.g. slack). @zertrin, maybe we should simply add send_notification next to email_logfile at https://github.com/zertrin/duplicity-backup/blob/b92d60f028dffb94dc3aff2cd674dce4d5a9f48c/duplicity-backup.sh#L436?
Actually there 10 appearances of exit in the script, maybe they should be replaced by some notificiation-sending function? (at least if the configuration was correct enough to set it up).

I fully agree. I'll look into this soon sometime since that's easier.

@zertrin

I did just what @jarondl suggested above and nothing more. I have two enhancements in mind:

  1. Figure out way to notification carrier a message that identifies the error (in this case, the stale lock)
  2. Handle the stale lock would be nice as suggested by @Luzifer

I let those two for later. However, regarding item 1 I don´t figured out the best way to do this, I think it may require a refactoring of send_notification in order to accept some optional parameter. Any thoughts?

How do you deal with rebooting the server you're backing up? Each time that I do, it's halfway through the last backup causing it to never start back up since the lockfile still exists.

It doesn't happen to me since my backup doesn't last that long and I'm never rebooting around the time where my backup is running.

Locking mechanisms are hard to get right and can be annoying. Still didn't found the time to implement a solution, but I welcome contributions that aim at doing locking "the right way" (probably with a PID check somewhere).