LINBIT/linstor-server

Snapshot rollback failed

Opened this issue · 2 comments

When the node unexpectedly shuts down and a snapshot is created, after some time the node returns to normal, but the snapshot rollback fails at this point.

[root@stor1 ~]# linstor n l
╭────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node  ┊ NodeType  ┊ Addresses               ┊ State                                        ┊
╞════════════════════════════════════════════════════════════════════════════════════════════╡
┊ stor1 ┊ SATELLITE ┊ 10.0.0.225:3366 (PLAIN) ┊ Online                                       ┊
┊ stor2 ┊ SATELLITE ┊ 10.0.0.170:3366 (PLAIN) ┊ Online                                       ┊
┊ stor3 ┊ SATELLITE ┊ 10.0.0.240:3366 (PLAIN) ┊ OFFLINE (Auto-eviction: 2024-09-03 17:38:12) ┊
╰────────────────────────────────────────────────────────────────────────────────────────────╯
To cancel automatic eviction please consider the corresponding DrbdOptions/AutoEvict* properties on controller and / or node level
See 'linstor controller set-property --help' or 'linstor node set-property --help' for more details
[root@stor1 ~]#
[root@stor1 ~]#
[root@stor1 ~]#
[root@stor1 ~]# linstor s c test1 snapshot2
WARNING:
    Snapshot for resource 'test1' will not be created on node 'stor3' because that node is currently offline.
SUCCESS:
Description:
    New snapshot 'snapshot2' of resource 'test1' registered.
Details:
    Snapshot 'snapshot2' of resource 'test1' UUID is: 93d33a54-c354-41b7-9d4d-f2b9611c5388
SUCCESS:
    (stor2) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Suspended IO of '[test1]' on 'stor2' for snapshot
SUCCESS:
    (stor1) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Suspended IO of '[test1]' on 'stor1' for snapshot
SUCCESS:
    (stor1) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    (stor1) Snapshot [ZFS-Thin] with name 'snapshot2' of resource 'test1', volume number 0 created.
SUCCESS:
    Took snapshot of '[test1]' on 'stor1'
SUCCESS:
    (stor2) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    (stor2) Snapshot [ZFS-Thin] with name 'snapshot2' of resource 'test1', volume number 0 created.
SUCCESS:
    Took snapshot of '[test1]' on 'stor2'
SUCCESS:
    (stor2) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Resumed IO of '[test1]' on 'stor2' after snapshot
SUCCESS:
    (stor1) Resource 'test1' [DRBD] adjusted.
SUCCESS:
    Resumed IO of '[test1]' on 'stor1' after snapshot
[root@stor1 ~]# linstor s l
╭───────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ SnapshotName   ┊ NodeNames           ┊ Volumes  ┊ CreatedOn           ┊ State      ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ test1        ┊ snapshot2      ┊ stor1, stor2        ┊ 0: 5 GiB ┊ 2024-09-03 17:28:37 ┊ Successful ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────╯

At this point, the node has returned to normal, but the snapshot rollback fails.

[root@stor1 ~]# linstor r l
╭───────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node  ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞═══════════════════════════════════════════════════════════════════════════════╡
┊ test1        ┊ stor1 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2024-09-03 14:28:07 ┊
┊ test1        ┊ stor2 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2024-09-03 14:28:07 ┊
┊ test1        ┊ stor3 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2024-09-03 14:28:07 ┊
╰───────────────────────────────────────────────────────────────────────────────╯
[root@stor1 ~]# linstor s rb test1 snapshot2
ERROR:
Description:
    Snapshot 'snapshot2' of resource 'test1' on node 'stor3' not found.
Details:
    Resource: test1, Snapshot: snapshot2
Show reports:
    linstor error-reports show 66D6C9AC-00000-000000

Hello,

Yes, there are some known limitations of the rollback implementation. We have already a few ideas how this could be improved in the future.

For now, what you can do is to delete the resource temporarily from stor3 node, run the rollback command and re-create the resource on stor3, which will receive the (rolled back) data from the other two nodes.

Alternatively, instead of rollback you could also restore the given snapshot into a new resource, but this approach might not fit your use-case.

Thank you for your reply.

I'm very interested in Linstor. Could you please share the approach and plan for addressing this issue (approximately when it will be fixed)?