Copying container from one server in an encrypted pool to another in an encrypted pool fails after performing a second copy/--refresh

Question

Copying container from one server in an encrypted pool to another in an encrypted pool fails after performing a second copy/--refresh

Opened this issue 2 months ago · 1 comments

Required information

Distribution: Ubuntu
Distribution version: 22.04
The output of "snap list --all lxd core20 core22 core24 snapd":
Name Version Rev Tracking Publisher Notes
core20 20240705 2379 latest/stable canonical✓ base,disabled
core20 20240911 2434 latest/stable canonical✓ base
core22 20240904 1621 latest/stable canonical✓ base,disabled
core22 20241001 1663 latest/stable canonical✓ base
lxd 6.1-efad198 29943 latest/stable canonical✓ disabled
lxd 6.1-78a3d8f 30130 latest/stable canonical✓ -
snapd 2.62 21465 latest/stable canonical✓ snapd,disabled
snapd 2.63 21759 latest/stable canonical✓ snapd
The output of "lxc info" or if that fails:
- Kernel version:
- LXC version:
- LXD version:
- Storage backend in use:

Issue description

A brief description of the problem. Should include what you were
attempting to do, what you did, what happened and what you expected to
see happen.

Using zfs as backend on one server, trying to copy a container from an encrypted zfs pool to a new server in an encrypted zfs pool works but when doing the same operation using --refresh, an error occurs and the destination server's container storage is lost. Default non-encrypted pools work OK. I am almost certain that I must be doing something wrong otherwise others would be hitting this issue.

(somewhat related - I have another server where I run daily --refresh of my containers which work fine, to an encrypted partition. But after a while, I get the same error (cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one) but the storage at least is not lost on the destination server. I haven't been able to pinpoint when exactly it occurs but it may have something to do with snapshots. Ie - every night a snapshot is ran, then every night a --refresh to the destination occurs, with snapshots auto expiring. But if the --refresh is not ran for a while, the destination snapshots expire then when the --refresh occurs again, it can't do the incremental properly. Then the only way to fix is a full copy from src to dst. But at least now with this new server I have set up, I can't even get that far).

Steps to reproduce

On server A:

Create new container
lxc launch ubuntu:24.04 c3 -s encpool
Creating c3
Starting c3

Copy to server B:
lxc stop c3
lxc copy c3 serverB: -s encpool
[success, no errors returned]

On server B:
sudo zfs list|grep c3
rpool/lxd/encrypted/containers/c3 659M 668G 659M legacy
On server A:
lxc copy c3 serverB: -s encpool --refresh
Error: Failed instance creation: Error transferring instance data: Failed migration on target: Failed creating instance on target: Failed receiving volume "c3": Problem with zfs receive: ([exit status 1 write |1: broken pipe]) cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one
On server B:
At this stage, the container c3 has lost its storage:
sudo zfs list|grep c3

[no output returned]

lxc info --show-log c3
Name: c3
Status: STOPPED
Type: container
Architecture: x86_64
Created: 2024/11/04 19:25 AEDT

Log:

zfs get encryption rpool/lxd/encrypted
NAME PROPERTY VALUE SOURCE
rpool/lxd/encrypted encryption aes-256-gcm -

(same output on both servers)

lxc config show c3 --expanded
architecture: x86_64
config:
image.architecture: amd64
image.description: ubuntu 24.04 LTS amd64 (release) (20241004)
image.label: release
image.os: ubuntu
image.release: noble
image.serial: "20241004"
image.type: squashfs
image.version: "24.04"
volatile.apply_template: copy
volatile.base_image: 74957a5580288913be8a8727d121f16616805e3183629133029ca907f210f541
volatile.cloud-init.instance-id: d264bee1-2e03-4030-a875-19b25e4a2a49
volatile.eth0.hwaddr: 00:16:3e:9e:a7:f7
volatile.idmap.base: "0"
volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.last_state.idmap: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
volatile.uuid: a89a025e-171b-47c2-bcf7-253e91280e9e
volatile.uuid.generation: a89a025e-171b-47c2-bcf7-253e91280e9e
devices:
eth0:
name: eth0
nictype: bridged
parent: br0
type: nic
root:
path: /
pool: encpool
type: disk
ephemeral: false
profiles:

default
stateful: false
description: ""
lxc-monitor.txt
lxc-refresh.txt

lxc-info.txt

Answer 1 · 2024-11-07T19:37:07.000Z

Using zfs as backend on one server, trying to copy a container from an encrypted zfs pool to a new server in an encrypted zfs pool works but when doing the same operation using --refresh, an error occurs and the destination server's container storage is lost. Default non-encrypted pools work OK. I am almost certain that I must be doing something wrong otherwise others would be hitting this issue.

Using an encrypted rpool and using lxc copy --refresh to another encrypted pool seems niche enough to me ;)

(somewhat related - I have another server where I run daily --refresh of my containers which work fine, to an encrypted partition. But after a while, I get the same error (cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one) but the storage at least is not lost on the destination server. I haven't been able to pinpoint when exactly it occurs but it may have something to do with snapshots. Ie - every night a snapshot is ran, then every night a --refresh to the destination occurs, with snapshots auto expiring. But if the --refresh is not ran for a while, the destination snapshots expire then when the --refresh occurs again, it can't do the incremental properly. Then the only way to fix is a full copy from src to dst. But at least now with this new server I have set up, I can't even get that far).

The first thing I'd try would be to run a fresher kernel/ZFS version.

Your lxc info says kernel_version: 5.15.0-52-generic which is very out of date. If you can, please try the latest kernel and also the latest HWE (6.8.0) one.

This way we can rule out any kernel/ZFS bug that's already been fixed.