khenderick/zfs-snap-manager

Run without daemonizing and ignoring the "time" settings in the config

mmeidlinger opened this issue · 10 comments

Hello.

First of all thank you for sharing your scripts!

I would like to have more fine grained control over the temporal control flow of the backup process. In my particular use case, a backup server needs to be brought online before the push based replication can take place. I think you could solve this with pull-based backups, but the backup server is located at a remote place and less trustworthy than my main server, so I'm hesitant to allow the backup server to access the main server.

I have a working script for booting and unlocking the backup the server, and initially, I tough I can run it as "preexec" command. However, I then realized that this is done per dataset and I only need to run my script once before replication of all datasets.

The only option I see with the current implementation is to trigger the backup manually or try to align times. However, this way, I end up with 2 control flows: The one of zfs-snap-manager and the one of my script that brings up and shuts down the server. Since those two things have to be in sync, I do not really like the idea of that.

I'd much rather prefer to handle timing in my own script / systemd service and just run a command to snapshot and replicate everything as specified in /etc/zfssnapmanager.cfg, ignoring the timing and deamon aspects of zfs-snap-manager. I've had a look at your scripts, and though maybe defining and ececute function around the if statement starting at

if execute is True:
would allow to access this functionality without all the daemonization aspects and restrictions?
@khenderick What is your opinion on that?

You could use preexec with a script that checks if your backup server online first, before starting it.

@SlothOfAnarchy: I considered the approach you proposed and I agree that booting the server could be done in that way. The boot process is however quite lengthy (about 5 min, it's also a little involved since I use expect scripts to unlock encrypted disks during startup, etc, I'll spare you with the details), so there is the possibility of race conditions. I.e. one dataset triggers the boot script, the next dataset checks to see if the server is online and tries to run the script again, even though a boot has already been triggered but is not finished yet.

There also is the problem of defining a "clean" and light weight way to determine that all backups have been replicated and the server could be brought down again.

Having the feature that I requested above, you could do all of that sequentially by scripting the temporal logic yourself and only use zfs-snap-manager for the actual replication. This way, very complex use cases can be accommodated and one can be sure that things happen in the intended order. Since all the functionality is there, I guess it would only amount to some refacturing of the code and directly exposing this functionality to the user (single run, ignore time setting)

I'm also willing to implement those changes myself once we have a clear idea on how to do it.

Unfortunately, I didn't receive feedback from the maintainer, so I went ahead and implemented the change as proposed in #32 (comment) in mmeidlinger@69f8b58
I'm currently testing the changes to see if nothing got broken.

From a user perspective, everything should be unchanged, except in addition to manager.py start|stop|restart you can also instruct it to do a single run ignoring the time settings via manager.py single-run.

Once I tested sufficiently, I will create a pull request. I'd be glad if somebody else would also have a look at the code changes/ test before hand.

Regards

One remaining issue is how to handle the time setting, since currently, it is a mandatory config option. I see two options here:

  1. Making time an optional argument. The daemonized process would then simply skip datasets that do not have time configured. The single-run execution would ignore time settings if present.
  2. Defining time = manual for datasets that are only to be considered by a single-run execution.

Other that style, I can only foresee one use case where the difference would matter: If one were interested in mixing both daemonized and single-run functionality, i.e. running the daemonized service while occasionally triggering a manual single-run. In that case, time = manualwould conflict with any time setting for the daemonized process. I personally can't really imagine a use-case where mixing would be useful, but maybe one of you can?

So option 1. is more general, while option 2. is more explicit and concise. Does anyone see other possibilities? Any opinions on that matter?

EDIT: 2. also might be useful if single-run should only apply to a subset of datasets (those marked with time = manual and not to all of them. Another possiblity would be to introduce an additional config item).

One last attempt to get the attention of @khenderick: If I do not receive any feedback from you within the next 2 weeks, I'm moving forward with a fork with whatever solution fits me best. You seem to be quite active on GitHub, so I do not really understand that you are unable to reply within 2 months. If you are not interested in the feature or a pull request, that's fine as well, it would just be nice to know.

Hi there,

I'm sorry to be quite unresponsive at this moment. My fulltime job (my main github activity) and personal life (kids and stuff) take up quite some time and don't leave much time for side projects.

Trying to answer your question; there's the option to configure time = trigger where the process will wait for a .trigger file on the filesystem that once present will trigger the backup. Do you have the possibility to have the backup server (or the script that turns on the backup server) touch that file?

In any case, feel free to open up a pull request, I'll try to merge them in as soon as possible. In any case, thanks for getting involved.

Hi, thanks for your response! I'm glad you still found the time, keeping things together is everyone's best interest I believe :)

I know about the trigger feature, but it's suboptimal for my use case for 2 reasons:

  1. You still have multiple control flows and a hard time figuring out when all replication finished to shut down the slave replication target.
  2. One essentially would need to maintain a second list of datasets and mountpoints for the control script so that it knows where to place triggers. This is not really a major issue and potentially could be inferred from zfssnapmanger.cfg, but inconsistencies might prevent datasets from being backup-ed.

Regarding the time setting I was wondering in my previous comment: I believe the most clean way is to introduce an additional config item single-run. If single-run = True, the data set is considered for non daemonized invocation, else it's ignored. This way, it's both explicit and does not conflict with any time setting. Also, I'd propose to make time optional an throw an error only if neither time nor single-run is configured.

I see what you mean. The single-run approach makes sense, however, since the time parameter already can have "special" values like trigger, it might be a tad cleaner to e.g. allow time = single-run, time = external, time = execute-cli or similar.

I quickly looked at your fork, and this approach indeed makes a lot of sense. As a suggestion it might be interesting to use Manager.start(...) in both cases, but add an optional single_run=True (defaults to False) to the start method, and forward this to the run method. Then the looping through the datasets could be shared.

PR #34 pending, closing as soon as merged.

I'm closing this issue as I unfortunately don't have the time anymore to maintain ZFS Snapshot Manager. I would like to point you towards the awesome tool zrepl which has nearly all of ZFS Snapshot Manager's features and much more.

In any case, thanks for being a part of this community.