ddebeau/zfs_uploader

Send stream could be sent compressed if the dataset is compressed

Erisa opened this issue · 4 comments

Erisa commented

If the dataset being backed up has a compression property set to anything other than off, the default behaviour of zfs send is to decompress on the fly and send the full uncompressed dataset.

Simply by adding a -c, --compressed flag to zfs send, this will instead be sent compressed and takes up significantly less space on the remote. In my case this reduced a full backup of a PostgreSQL database from 56 GB to 24 GB.

I added this flag to my personal fork in Erisa@c192333 and noticed no regressions or repercussions, however since users may not always have their dataset set to compress or want this behaviour to change across versions, I believe the best way forward would be to add a zfs_uploader config variable that will enable this compressed flag.

Thanks for the interest! We set the raw flag -w with zfs send which is equivalent to -Lec for unencrypted datasets. The raw flag is required for sending encrypted datasets.

https://openzfs.github.io/openzfs-docs/man/8/zfs-send.8.html#w

cmd = ['zfs', 'send', '-w', f'{filesystem}@{snapshot_name}']

I'm not sure why your dataset would be sent uncompressed. Which version of ZFS are you using?

Erisa commented

Interesting, I had looked at the raw flag but didn't quite realise it was supposed to be doing anything with compression on unencrypted datasets.

The ZFS version I was/is using for that is admittedly a little old since it's from the Ubuntu 20.04 repos:

zfs-0.8.3-1ubuntu12.13
zfs-kmod-0.8.3-1ubuntu12.13

Perhaps a newer version may handle it better? The dataset in question is unencrypted and has compression=lz4

When I tried without the -c flag it tried to send the full uncompressed size:

time=2022-03-19T05:53:28.621 level=INFO filesystem=rpool/synapse snapshot_name=20220319_055300 s3_key=rpool/synapse/20220319_055300.full progress=1% speed="29 MBps" transferred="309/56196 MB" time_elapsed=0m

That size went down to 24837 MB after adding -c to the code, which lines up with the compressed size of the dataset at the time.

Could you check the file size in the S3 bucket? The file size should not change when adding -c.

I think the problem is that we're not setting -w when we're calculating the snapshot size:

def get_snapshot_send_size(filesystem, snapshot_name):
cmd = ['zfs', 'send', '--parsable', '--dryrun',
f'{filesystem}@{snapshot_name}']
out = subprocess.run(cmd, **SUBPROCESS_KWARGS)
return out.stdout.splitlines()[1].split()[1]
def get_snapshot_send_size_inc(filesystem, snapshot_name_1, snapshot_name_2):
cmd = ['zfs', 'send', '--parsable', '--dryrun', '-i',
f'{filesystem}@{snapshot_name_1}',
f'{filesystem}@{snapshot_name_2}']
out = subprocess.run(cmd, **SUBPROCESS_KWARGS)
return out.stdout.splitlines()[1].split()[1]

Erisa commented

You're right, in the past I was cancelling the job before it could actually finish uploading which is how I never saw the distinction, my bad there.

Running it with my latest incremental snapshot and leaving it to complete shows that it is indeed only the estimate that's incorrect, and the compressed dataset is what's sent:

time=2022-03-26T15:04:01.669 level=INFO filesystem=rpool/synapse snapshot_name=20220326_150200 s3_key=rpool/synapse/20220326_150200.inc progress=44% speed="30 MBps" transferred="3613/8200 MB" time_elapsed=2m
time=2022-03-26T15:04:06.673 level=INFO filesystem=rpool/synapse snapshot_name=20220326_150200 s3_key=rpool/synapse/20220326_150200.inc progress=46% speed="30 MBps" transferred="3764/8200 MB" time_elapsed=2m
time=2022-03-26T15:04:10.644 level=INFO filesystem=rpool/synapse snapshot_name=20220326_150200 s3_key=rpool/synapse/20220326_150200.inc msg="Finished incremental backup."
time=2022-03-26T15:04:10.644 level=INFO filesystem=rpool/synapse msg="Finished job."

It estimates that the size will be 8200 MB but finishes with the last progress being 3764 MB.

And the object is ~4 GB:
image