Hangs in StartingSyncS on drbdadm invalidate-remote
johannesthoma opened this issue · 4 comments
Windows Primary doing I/O, Linux Secondary.
On Windows execute invalidate-remote (2 times in a row, maybe this takes longer to trigger) eventually Windows side shows:
w0 role:Primary
volume:17 disk:UpToDate blocked:upper
johannes-VirtualBox role:Secondary
volume:17 replication:StartingSyncS peer-disk:Inconsistent
while Linux side shows:
w0 role:Secondary
volume:17 disk:UpToDate
linbit-wdrbd role:Secondary
volume:17 peer-disk:UpToDate
No new DRBD commands possible on Windows side.
Log shows many FIXME messages:
Oct 15 13:20:27 192.168.56.103 U11:20:24.244|0334fee0(drbd_w_w0) #177 bm_rw_range <6>drbd w0/17 minor 5, ds(UpToDate), dvflag(0x80082): bitmap WRITE of 1 pages took 15 ms
Oct 15 13:20:28 192.168.56.103 U11:20:24.284|0334fd30(drbd_r_w0) #178 print_state_change <6>drbd w0/17 minor 5 pnode-id:3, pdsk(Outdated), prpl(StartingSyncS), pdvflag(0x1c00): pdsk( Outdated -> Inconsistent )
Oct 15 13:20:28 192.168.56.103 U11:20:24.338|040252f0(not_drbd_thread) #179 patch_boot_sector Patching boot sector from DRBD to NTFS
Oct 15 13:20:28 192.168.56.103 U11:20:24.338|0277d7a0(drbd_a_w0) #180 bm_print_lock_info <3>drbd w0/17 minor 5, ds(UpToDate), dvflag(0xa0282): FIXME drbd_a_w0[145] op clear, bitmap locked for 'set_n_write from StartingSync' by drbd_w_w0[30]
Oct 15 13:20:28 192.168.56.103 U11:20:24.512|0277d7a0(drbd_a_w0) #181 bm_print_lock_info <3>drbd w0/17 minor 5, ds(UpToDate), dvflag(0xa0282): FIXME drbd_a_w0[145] op clear, bitmap locked for 'set_n_write from StartingSync' by drbd_w_w0[30]
There is no hang when there is no I/O on the Windows side. Suspend-io before and resume-io after invalidate-remote does not help. However setting the Windows resource secondary, doing invalidate-remote, waiting for sync to finish and then setting the resource to Primary again would be a workaround for now (however I/O returns failure as long the resource is secondary).
Executing drbdadm invalidate on the (secondary) node which should be invalidated instead of executing invalidate-remote on the Primary leads to the same result (Primary stuck in StartingSyncS).
Fixing the wake_up() Linux kernel emulator function seems to have fixed this bug as well. Will be released with 1.0.0-rc8. Currently ran 50+ iterations without hang.
Ran for 100 iterations