openvstorage/volumedriver

Local restart fails if a single snapshot is created but only partially on the backend and the MDS is running behind

Closed this issue · 2 comments

Log messages:

5111/0x00007f79b6665700 - volumedriverfs/VolumeFactory - 000000000001274f - info - replay_tlogs_on_backend_since_last_cork: Snapshot Cork: --, MetaDataStore Cork:  f622ae68-0788-43f2-8db5-e9d047c528bf, implicit start Cork --
5111/0x00007f79b6665700 - volumedriverfs/VolumeFactory - 0000000000012750 - fatal - replay_tlogs_on_backend_since_last_cork: void volumedriver::{anonymous}::replay_tlogs_on_backend_since_last_cork(const volumedriver::VolumeConfig&, volumedriver::MetaDataStoreInterface&, const volumedriver::SnapshotPersistor&): sp_cork != boost::none

The restart ends up there under the following conditions:

  • a snapshot was created that consists of multiple tlogs, some of which are on the backend
  • voldrv was restarted (after a crash / power-cycle) before all tlogs are on the backend
  • the MetaDataStore cork is behind (Why? MDS failover?):
    • MD cork: tlog X
    • last tlog marked as on the backend: tlog (X + 1)
    • last tlog of snapshot: tlog (X + 2)
      .

The restart code determines that there are tlogs on the backend since the last MD cork, the SnapshotPersistor does however not determine the cork ID (as the snapshot itself is not on the backend yet):

    const boost::optional<yt::UUID> sp_cork(sp.lastCork());
   // ...
   const OrderedTLogIds tlogs_to_replay_now(sp.getTLogsOnBackendSinceLastCork(md_cork,
                                                                          start_cork));

        if (not tlogs_to_replay_now.empty())
        {
            VERIFY(sp_cork != boost::none);

Workaround: backend restart (with sync DTL!)

Inspection of the TLog callback shows that the TLog is marked as written to the backend before uncorking the metadata store which is a possible explanation for the MD cork running behind.

Test: volumedriver_test, LocalRestartTest.partial_first_snapshot_with_mdstore_running_behind:

[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from LocalRestartTests/LocalRestartTest
[ RUN      ] LocalRestartTests/LocalRestartTest.partial_first_snapshot_with_mdstore_running_behind/0
[       OK ] LocalRestartTests/LocalRestartTest.partial_first_snapshot_with_mdstore_running_behind/0 (2998 ms)
[----------] 1 test from LocalRestartTests/LocalRestartTest (2998 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (2998 ms total)
[  PASSED  ] 1 test.