LINBIT/windrbd

1.0.0-rc1 is crashing Win10 VM under heavy load [DRBD Data disks]

acidrop opened this issue · 28 comments

Hi Johannes,

As promised I did several fio crash tests on 1.0.0-rc1, on top of DRBD data disks this time though.
In general, most of the tests have been successful, however I'm managed to reproduce the following crash each time ...

  • 2 x Win10 VMs running on VirtualBox (Windows host).
  • 2 x DRBD data disks (one on each VM).
  • Using standard windrbd config file as a template to setup resources (windrbd-sample.res).
  • When using fio with these two command line parameters: "--iodepth=10 --numjobs=4" cause whole Win10 VM to crash (i.e cold reboot).

Perhaps this might be a corner case, but thought that it might be useful to report it. The exact fio command is the following...

fio --filename=test.img --size=2G --direct=1 --rw=rw --bs=4k --iodepth=10 --runtime=1200 --numjobs=4 --time_based --name=iops-test-job

VM1.log
VM2.log

drbd-data.res.txt

Yannis

Hi Yannis, Thank you for your bug report. It might be the same error that causes the boot setup to lose connection, since the fio test is quite similar.

Just because I'm curious: Can you post tests that succeeded? Did you only do I/O tests or did you also experiment with
drbdadm (for example drbdadm up down loop or similar)?

  • Johannes

All fio tests with values --iodepth=1-9 and --numjobs=1-3 succeed.
I did also experiment drbdadm (down,up,disconnect,connect etc). In general that has been working OK. There were some cases though where "drbdadm disconnect res" was hanging and "drbdadm status res" was showing "blocked: upper". Not sure what that means, but on those cases only way to recover was to reboot the VM.

Another hard crash, this time while using "--iodepth=4" "--numjobs=2" on the Primary (VM2). The crash occurred on the Secondary(VM1) node, while experimenting with "drbdadm disconnect/connect res". Secondary node froze and after some seconds it crashed (hard reset).

VM1.log
VM2.log

Unfortunately I also noticed data divergence in fio test.img file between two nodes. It may have occurred during the failed disconnect/connect, but the strange part is that "drbdadm status res" shows as Uptodate on both nodes.

image

Hi Yannis thank you for your detailed report, I will try to reproduce and fix, this may take a while .. (could even be weeks).
Just one question: what did you do to make drbdadm disconnect hang?

Hi Johannes,

The data divergence issue occurred only once and it was during high load (fio). drbdadm disconnect/connect hung for some seconds then had to use drbdadm down/up again. From that moment both sides were reporting as UpToDate, but when switching to the Primary role on each side, clearly the data was not the same. I can reproduce the drbdadm disconnect hang relatively easily by issuing a heavy duty fio job and trying to disconnect/connect on the Secondary while I/O is in progress.

Yannis

It was possible to switch to Primary at the same time? But only when the nodes are disconnected, right? But you said there was data divergence even before the split brain situation .. hmm. shouldn't happen, especially with protocol C. Let me try to reproduce this, maybe it is a DRBD (not WinDRBD) bug.

No, dual Primaries is disabled in the config. What I did was switching from Secondary to Primary, one time per host, while both nodes were in a Connected state. However when md5sum-ing the same file each side it was showing a different checksum. That was weird, however as I said it happened only once so maybe I hit a corner case, might be an issue in DRBD as you said. I guess running drbdadm verify (if that was supported in WinDRBD) would have found the problem an correct it ?

drbdadm verify should have detect the problem and log it in the kernel log. It does not correct the problem. There is no way to do this automatically, since one has to decide which data will be lost. drbdadm verify is already supported (it was one of the last features I implemented), maybe you could give it a try and see if it detects the data divergence?

oh you are right and actually that's what I did to overcome the problem (I issued drbdadm disconnect on the Secondary followed by drbdadm --discard-my-data connect res). Now both sides seem to show the correct checksum, but will give drbdadm verify a go too... will report back any findings.

cheers

Hi Yannis, I might have fixed the fio bug with --depth 10 you reported. The problem was that sometimes I/O requests are sent with raised IRQL (something like a bottom half) and DRBD waits and sleeps for concurrent writes to finish. With the -rc2 patch, drbd_make_request() is called from a workqueue (at PASSIVE_LEVEL) and therefore also may sleep when the DRBD request is created. I've tested this patch under Windows 7 and the fio command you sent me does not crash the machine, however I couldn't test it under Windows 10 (which may behave in a different way). Could you please re-run the tests with the attached 1.0.0-rc2 version? That would be very helpful, Thanks a lot,

Hi Johannes,

Tried to install rc2 on 2 VMs. It installed successfully on one, but on the other the VM it caused a hard crash (BSOD).
Tried completely uninstalling rc1 and then installing rc2 but the result is the same. I chose to not install bus driver but that did not make any difference either. Finally I manually installed storage driver from device manager, but when I try to bring up DRBD resource, I'm receiving the error below (kernel driver missing?)... Is there anything else I can try to install rc2 on this VM ?

image
image

Hi Yannis you mean it crashed when you installed it? Did you drbdadm down on all resources first? Strange, I will look at it when I get back to the office on Monday, thanks for reporting,

Best wishes Johannes

Hi Johannes,
Yes, it was crashing during the installation of v1.0.0-rc2.
I finally managed to get v1.0.0-rc2 on the 2nd VM. Had to …

  • Completely uninstall current installed version v1.0.0-rc1 via its uninstaller
  • Manually remove its leftovers (c:\windrbd, c:\program files\windrbd and c:\windows\system32\drivers\windrbd.sys).
  • I left the content of c:\windrbd\etc as is, in order to preserve resource configuration files.
  • Rebooted and then installed v1.0.0-rc2 via its installer. It did not crash this time.
    I noticed that Windows keeps a history of all previously installed WinDRBD driver versions. Was wondering if these can be cleaned somehow, preferably by WinDRBD uninstaller ? Could that have caused this behaviour?

image

I ran several fio tests with iodepth=10 and numjobs=4 and I can confirm that there are no crashes anymore. It successfully passed all tests(congrats!).
On the side note, I noticed that v1.0.0-rc2 broke diskless boot mode. Sometimes it gives an INACCESSIBLE_BOOT_DEVICE BSOD and others it just hard crashes during boot (after Windows logo shows up).
I’m able to boot successfully if I revert to v1.0.0-rc1 though. Can you test on your side if the above is the case for you as well?
Thanks,
Yannis

Hi Yannis, thanks a lot for your report. Regarding the boot issue, my first guess is that maybe there was no bus device installed? INACCESSIBLE_BOOT_DEVICE could be caused when there is no WinDRBD bus device on the system (Windows looks there for boot drives). I will try to reproduce the install error and the boot error and then get back to you,

Thanks again for the very useful reports, good to know that fio tests pass now. I will prepare a patch in the next few days that also should fix the Github issue #1 (the one where it loses connection on a fio test) I think this can be fixed with a similar patch.

  • Johannes

Was the 2nd VM the one that crashed frequently during fio tests before? The reason for asking is that I just had a temporary boot failure on a test VM that crashed before .. it seems that Windows absolutely dislikes system crashes and may get unbootable if a crash happens at the wrong point in time...

  • Johannes

No, both VMs were crashing with same frequency when I was doing the fio tests on them (as DRBD Data disks).

Regarding WinDRBD boot drive issue, it's strange because WinDRBD Virtual Bus Device seems to be installed (it shows up twice in device manager). But perhaps the installer did not finish installing it correctly and something else is missing, maybe in the registry which prevents the system from booting...

Yannis

Hi Yannis, I will try to reproduce the boot issue next.

Currently I observe strange bugs with Windows 10 machines that crashed, one for example cannot start any applications (firefox, WinDRBD installer) any more after crashing. It seems to be an unrelated problem (unrelated to WinDRBD), so I think we just should make sure that Windows does not crash because of WinDRBD ... I think the cannot install rc2 bug you reported is also because of a former Windows crash, but I might be wrong.

Just because I am curious: Do you think the performance numbers of WinDRBD are ok?

Best wishes,

  • Johannes

Hi Yannis, just for info I just could reproduce the crash on installing rc2 ... thanks for reporting it,

  • Johannes

I haven't experienced such issues on my Win10 VMs, even after multiple BSODs. It sounds like a fs corruption, have you tried running a check disk on Windows system drive (C:) ? Event Viewer may also have some clues.

Regarding the performance numbers, I believe they are OK for VM environment but the best would be to test on real hardware. I will try to test next time I'm in the office. I have a two laptop setup there with a 1Gbps network switch in the middle.

Glad that you managed to reproduce the crash, it feels better when you are not alone... :-)

Yannis

Hi Yannis, I fixed something that might have caused the BSODs on upgrade. To test it you first need to uninstall the current (rc1 or rc2) release. This is done by a

del c:\windows\system32\drivers\windrbd.sys

and a reboot. When rebooting no windrbd driver (causing the BSOD) is loaded. Then you can install the rc3 (find it attached) and further upgrades should (in theory) not BSOD (you can test it by installing rc3 again).

I am not 100% sure if this patch solves the problem, in theory it should but please give it a try ...

Thanks a lot,

Hi Johannes,

I installed rc3 on both (2) VMs following your instructions and it was successful this time. I tried also reinstalling rc3 and that was successful as well.

Only issue I'm having is that I cannot boot anymore via drbd diskless mode, it gives a BSOD during boot (inaccessible boot device). You mentioned that this is caused by missing windrbd virtual scsi bus driver, but that seems to be already installed in the device manager. Is this issue reproducible at your end ?

If I rollback to rc1, I can boot without issues, so it must be something missing during the rc2,rc3 installation ?

Yannis

Hi Yannis,

I can confirm the crash with rc3 and I am currently working on a fix (something in the rc2 patch was missing), thanks again for reporting,

  • Johannes

Hi Yannis, I think I have a fix now, I will build a release and send it to you via github,

  • Johannes

Hi Yannis, this is the fix for the boot problem. As you pointed out correctly there was a regression in 1.0.0-rc2 which is fixed now. Booting via WinDRBD should work now again, maybe you can test it on your side?

install-windrbd-1.0.0-rc4-signed.exe.zip

Best wishes,

  • Johannes

Hi Johannes,

I confirm that everything seems to be working fine with rc4, I think we can close this now..

Regards,
Yannis

Hi Yannis,

This is good news, I will close the #2 issue then. Thanks for the good work. If you have time to do further tests, please do :)

Best,

  • Johannes

Oops it seems like I don't have the permission to close this issue (or I just can't find the button ...) can you close it for me, please?

Thanks,

  • Johannes