/cephgeorep

An efficient unidirectional remote backup daemon for CephFS.

Primary LanguageC++GNU Lesser General Public License v2.1LGPL-2.1

cephgeorep

Ceph File System Remote Sync Daemon
For use with a distributed Ceph File System cluster to georeplicate files to a remote backup server.
This daemon takes advantage of Ceph's rctime directory attribute, which is the value of the highest mtime of all the files below a given directory tree node. Using this attribute, it selectively recurses only into directory tree branches with modified files - instead of wasting time accessing every branch.

Prerequisites

You must have a Ceph file system. rsync, scp, or similar must be installed on both the local system and the remote backup. You must also set up passwordless SSH from your sender (local) to your receiver (remote backup) with a public/private key pair to allow rsync to send your files without prompting for a password. For compilation, boost development libraries are needed. The binary provided is statically linked, so the server does not need boost to run the daemon.

Quick Start

  • Install
  • Initialize configuration file: cephfssyncd -d (This can be skipped if you installed from .rpm or .deb)
  • Edit according to Configuration: vim /etc/ceph/cephfssyncd.conf
  • Verify settings with dry run before seeding: cephfssyncd -s -d
  • Set up passwordless SSH from the sender to the receiver
  • Enable daemon: systemctl enable --now cephfssyncd

Installation

Current Release

Centos 7

  • yum install https://github.com/45Drives/cephgeorep/releases/download/1.2.13/cephgeorep-1.2.13-3.el7.x86_64.rpm

Centos 8

  • yum install https://github.com/45Drives/cephgeorep/releases/download/1.2.13/cephgeorep-1.2.13-3.el8.x86_64.rpm

Ubuntu 20.04

  • wget https://github.com/45Drives/cephgeorep/releases/download/1.2.13/cephgeorep_1.2.13-3focal_amd64.deb
  • apt install ./cephgeorep_1.2.13-3focal_amd64.deb

Ubuntu 18.04

  • wget https://github.com/45Drives/cephgeorep/releases/download/1.2.13/cephgeorep_1.2.13-3bionic_amd64.deb
  • apt install ./cephgeorep_1.2.13-3bionic_amd64.deb

Installing from Source

  • Install Boost (libboost-dev) and Thread Building Blocks (libtbb-dev) development libraries
  • git clone https://github.com/45drives/cephgeorep
  • cd cephgeorep
  • git checkout tags/1.2.13
  • make -j8 or make -j8 static to statically link libraries
  • sudo make install

Uninstalling from Source

  • In the same directory as makefile: sudo make uninstall

Configuration

Default config file generated by daemon: (/etc/cephfssyncd.conf)

# local backup settings
Source Directory =            # path to the ceph directory you want backed up
Ignore Hidden = false         # ignore files beginning with "."
Ignore Windows Lock = true    # ignore files beginning with "~$"
Ignore Vim Swap = true        # ignore vim .swp files (.<filename>.swp)

# remote settings
Destination =                 # one or more backup targets (failover only)
# list of destinations can be space or comma separated and Destination can be
# defined multiple times to append more failover targets.
# Destination format: [[user@]host:][path]
# Destination = root@backup-gw1:/tank/backup,root@backup-gw2:/tank/backup

# daemon settings
Exec = rsync                  # program to use for syncing - rsync or scp
Flags = -a --relative         # execution flags for above program (space delim)
Metadata Directory = /var/lib/cephgeorep/ # put metadata on the ceph cluster if
                                          # you want to use pacemaker with
                                          # redundant gateways
Sync Period = 10              # time in seconds between checks for changes
Propagation Delay = 100       # time in milliseconds between snapshot and sync
Processes = 4                 # number of parallel sync processes to launch
Threads = 8                   # number of worker threads to search for files
Log Level = 1
# 0 = minimum logging
# 1 = basic logging
# 2 = debug logging
# Propagation Delay is to account for the limit that Ceph can
# propagate the modification time of a file all the way back to
# the root of the sync directory.

You can also specify a different config file with the command line argument -c or --config, i.e. cephfssyncd -c /alternate/path/to/config.conf. If you are planning on running multiple instances of cephfssyncd with different config files, be sure to have unique paths for Metadata Directory for each config.

* The Ceph file system has a propagation delay for recursive ctime to make its way from the changed file to the top level directory it's contained in. To account for this delay in deep directory trees, there is a user-defined delay to ensure no files are missed. This delay was greatly reduced in the Ceph Nautilus release, so a delay of 100ms is the new default. This was able to sync 1000 files, 1MB each, randomly placed within 3905 directories without missing one. If you find that some files are being missed, try increasing this delay.

Usage

Launch the daemon by running systemctl start cephfssyncd, and run systemctl enable cephfssyncd to enable launch at startup. To monitor output of daemon, run journalctl -u cephfssyncd -f.

Arguments and Ad Hoc Commands

cephfssyncd usage:

cephfssyncd Copyright (C) 2019-2021 Josh Boudreau <jboudreau@45drives.com>
This program is released under the GNU General Public License v2.1.
See <https://www.gnu.org/licenses/> for more details.

Usage:
  cephfssyncd [ flags ]
Flags:
  -c --config </path/to/config> - pass alternate config path
                                  default config: /etc/ceph/cephfssyncd.conf
  -d --dry-run                  - print total files that would be synced
                                  when combined with -v, files will be listed
                                  exits after showing number of files
  -h --help                     - print this message
  -n --nproc <# of processes>   - number of sync processes to run in parallel
  -o --oneshot                  - manually sync changes once and exit
  -q --quiet                    - set log level to 0
  -s --seed                     - send all files to seed destination
  -S --set-last-change          - prime last change time to only sync changes
                                  that occur after running with this flag.
  -t --threads <# of threads>   - number of worker threads to search for files
  -v --verbose                  - set log level to 2
  -V --version                  - print version and exit

Alternate configuration files can be specified using the -c --config flag, which is useful for running multiple instances of cephfssyncd on the same system. -n --nproc, -q --quiet, -t --threads and -v --verbose are used to override options from the configuration file. -s --seed is used for sending every file to the destination regardless of how old the file is. -d --dry-run will run the daemon without actually syncing any files to give the user an idea of how many files will be synced if actually ran. -d --dry-run combined with -v --verbose will also list all files that would be synced.

Usage with cron

To have cron take care of when syncing happens, make sure that the systemd service is disabled (systemctl disable --now cephfssyncd) and create a cron job entry to execute cephfssyncd --oneshot. This can also be done with systemd timers if the systemd unit file is modified to pass the --oneshot flag to cephfssyncd.
Cron example: sync every sunday at 8 AM.

0 8 * * 0 stdbuf -i0 -o0 -e0 cephfssyncd --oneshot |& ts '[%F %H:%M:%S]' >> /var/log/cephgeorep.log 2>&1
#         ^ unbuffer output  ^ call with oneshot   ^ pipe into timestamp ^ append to log file       ^ redirect stderr too

Usage with s3 Buckets

For use with backing up to aws s3 buckets, there is some special configuration to be done. The wrapper script s3wrap.sh included with the binary release allows the daemon to work with s3cmd seamlessly. Ensure s3cmd is installed and configured on your system, and use the following example configuration file as a starting point:

# local backup settings
Source Directory = /mnt/cephfs           # full path to directory to backup
Ignore Hidden = false         # ignore files beginning with "."
Ignore Windows Lock = true    # ignore files beginning with "~$"
Ignore Vim Swap = true        # ignore vim .swp files (.<filename>.swp)

# remote settings
# the following settings *must* be left blank for use with s3wrap.sh
Destination =

# daemon settings
Exec = /opt/45drives/cephgeorep/s3wrap.sh   # full path to s3wrap.sh
Flags = sync_1                              # place only the name of the s3 bucket here

# the rest of settings can remain as default ##########
Metadata Directory = /var/lib/cephfssync/
Sync Period = 10              # time in seconds between checks for changes
Propagation Delay = 100       # time in milliseconds between snapshot and sync
Processes = 1                 # number of parallel sync processes to launch
Threads = 8                   # number of worker threads to search for files
Log Level = 1

With this setup, cephfssyncd will call the s3cmd wrapper script, which in turn calls s3cmd put ... for each new file passed to it by cephfssyncd, maintaining the directory tree hierarchy.

Notes

  • Windows does not update the mtime attribute when drag/dropping or copying a file, so files that are moved into a shared folder will not sync if their Last Modified time is earlier than the most recent sync.
  • When the daemon is killed with SIGINT, SIGTERM, or SIGQUIT, it saves the last sync timestamp to disk in the directory specified in the configuration file to pick up where it left off on the next launch. If the daemon is killed with SIGKILL or if power is lost to the system causing an abrupt shutdown, the daemon will resync all files modified since the previously saved timestamp.

45Drives Logo