LINBIT/windrbd

Hangs in StartingSyncS on drbdadm invalidate-remote

johannesthoma opened this issue · 4 comments

Windows Primary doing I/O, Linux Secondary.

On Windows execute invalidate-remote (2 times in a row, maybe this takes longer to trigger) eventually Windows side shows:

w0 role:Primary
  volume:17 disk:UpToDate blocked:upper
  johannes-VirtualBox role:Secondary
    volume:17 replication:StartingSyncS peer-disk:Inconsistent

while Linux side shows:

w0 role:Secondary
  volume:17 disk:UpToDate
  linbit-wdrbd role:Secondary
    volume:17 peer-disk:UpToDate

No new DRBD commands possible on Windows side.

Log shows many FIXME messages:

Oct 15 13:20:27 192.168.56.103  U11:20:24.244|0334fee0(drbd_w_w0) #177 bm_rw_range <6>drbd w0/17 minor 5, ds(UpToDate), dvflag(0x80082): bitmap WRITE of 1 pages took 15 ms
Oct 15 13:20:28 192.168.56.103  U11:20:24.284|0334fd30(drbd_r_w0) #178 print_state_change <6>drbd w0/17 minor 5 pnode-id:3, pdsk(Outdated), prpl(StartingSyncS), pdvflag(0x1c00): pdsk( Outdated -> Inconsistent )
Oct 15 13:20:28 192.168.56.103  U11:20:24.338|040252f0(not_drbd_thread) #179 patch_boot_sector Patching boot sector from DRBD to NTFS
Oct 15 13:20:28 192.168.56.103  U11:20:24.338|0277d7a0(drbd_a_w0) #180 bm_print_lock_info <3>drbd w0/17 minor 5, ds(UpToDate), dvflag(0xa0282): FIXME drbd_a_w0[145] op clear, bitmap locked for 'set_n_write from StartingSync' by drbd_w_w0[30]
Oct 15 13:20:28 192.168.56.103  U11:20:24.512|0277d7a0(drbd_a_w0) #181 bm_print_lock_info <3>drbd w0/17 minor 5, ds(UpToDate), dvflag(0xa0282): FIXME drbd_a_w0[145] op clear, bitmap locked for 'set_n_write from StartingSync' by drbd_w_w0[30]

There is no hang when there is no I/O on the Windows side. Suspend-io before and resume-io after invalidate-remote does not help. However setting the Windows resource secondary, doing invalidate-remote, waiting for sync to finish and then setting the resource to Primary again would be a workaround for now (however I/O returns failure as long the resource is secondary).

Executing drbdadm invalidate on the (secondary) node which should be invalidated instead of executing invalidate-remote on the Primary leads to the same result (Primary stuck in StartingSyncS).

Fixing the wake_up() Linux kernel emulator function seems to have fixed this bug as well. Will be released with 1.0.0-rc8. Currently ran 50+ iterations without hang.

Ran for 100 iterations