LINBIT/windrbd

Issues with Windows diskless boot process

acidrop opened this issue · 104 comments

Hello,

I was following your guide "Setting up WinDRBD for diskless boot using VirtualBox VMs" but I'm stuck at the last step, getting the Windows VM to PXE boot.

I have configured both dhcp server and apache2 as per instructions, even though some things were missing like for example in the dhcp server configuration and enabling cgi module in apache2, but perhaps those were out of scope of the guide.

Windows VM seems to obtain an IP address from the dhcp server, but then it times out waiting for TFTP server (?) to respond...

image

My dhcpd server config:

`log-facility daemon;
allow booting;
allow bootp;
subnet 10.2.2.0 netmask 255.255.255.0 {
authoritative;
range 10.2.2.10 10.2.2.20;
default-lease-time 3600;
max-lease-time 3600;
option subnet-mask 255.255.255.0;
option broadcast-address 10.2.2.255;
option routers 10.2.2.1;
}

option space ipxe;
option ipxe-encap-opts code 175 = encapsulate ipxe;
option ipxe.priority code 1 = signed integer 8;
option ipxe.keep-san code 8 = unsigned integer 8;
option ipxe.skip-san-boot code 9 = unsigned integer 8;
option ipxe.syslogs code 85 = string;
option ipxe.cert code 91 = string;
option ipxe.privkey code 92 = string;
option ipxe.crosscert code 93 = string;
option ipxe.no-pxedhcp code 176 = unsigned integer 8;
option ipxe.bus-id code 177 = string;
option ipxe.san-filename code 188 = string;
option ipxe.bios-drive code 189 = unsigned integer 8;
option ipxe.username code 190 = string;
option ipxe.password code 191 = string;
option ipxe.reverse-username code 192 = string;
option ipxe.reverse-password code 193 = string;
option ipxe.version code 235 = string;
option iscsi-initiator-iqn code 203 = string;
option ipxe.pxeext code 16 = unsigned integer 8;
option ipxe.iscsi code 17 = unsigned integer 8;
option ipxe.aoe code 18 = unsigned integer 8;
option ipxe.http code 19 = unsigned integer 8;
option ipxe.https code 20 = unsigned integer 8;
option ipxe.tftp code 21 = unsigned integer 8;
option ipxe.ftp code 22 = unsigned integer 8;
option ipxe.dns code 23 = unsigned integer 8;
option ipxe.bzimage code 24 = unsigned integer 8;
option ipxe.multiboot code 25 = unsigned integer 8;
option ipxe.slam code 26 = unsigned integer 8;
option ipxe.srp code 27 = unsigned integer 8;
option ipxe.nbi code 32 = unsigned integer 8;
option ipxe.pxe code 33 = unsigned integer 8;
option ipxe.elf code 34 = unsigned integer 8;
option ipxe.comboot code 35 = unsigned integer 8;
option ipxe.efi code 36 = unsigned integer 8;
option ipxe.fcoe code 37 = unsigned integer 8;
option ipxe.vlan code 38 = unsigned integer 8;
option ipxe.menu code 39 = unsigned integer 8;
option ipxe.sdi code 40 = unsigned integer 8;
option ipxe.nfs code 41 = unsigned integer 8;
option ipxe.no-pxedhcp 1;
option ipxe.windrbd code 42 = unsigned integer 8;
option ipxe.windrbd-root code 196 = string;

host windows10-boot {
hardware ethernet 08:00:27:58:7e:22;
fixed-address 10.2.2.10;
if exists ipxe.windrbd {
filename "";
option root-path "http://10.2.2.1/cgi-bin/drbd.cgi?DRBD_MINOR=42";
option ipxe.windrbd-root "drbd:windows10-boot;C;2;0.0.0.0:7690;1;1;ubuntusrv;1;10.2.2.1:7690";
}

else {
filename "http://10.2.2.1/ipxe/ipxe-windrbd.pxe";
}
}

ddns-update-style none;`

Both dhcpd and apache2 services seem to be active and I have verified URLs are working with curl.

What I'm missing ?

Thanks in advance,
Yannis

Hi,

You probably figured out that yourself, but to load the WinDRBD enabled
iPXE image you will need to setup a tftp server and fill in the tftp filename
(instead of http:// ... ipxe-windrbd.pxe in the else clause). Thanks for
using WinDRBD and I would like to hear if you are making progress,

Best wishes,

  • Johannes

Hi Yannis,

In the docs I described installing in a VirtualBox VM, VirtualBox (at lease version 6) has a iPXE with http support built in. What environment are you using? I then can document the tftp thing in the tech guide also.

You might want to subscribe to drbd-user@lists.linbit.com (go to https://lists.linbit.com/) this is the preferred way to discuss DRBD (and WinDRBD) issues.

Best regards,

  • Johannes

Hi Yannis it seems like you haven't configured the network as a host only adapter (your IP address is 10.2.2.1 which looks more like a NATed address). Please read 5.5. Setting up network adapters in the documentation (You need to change the adapter to Host Only network and enable a second one as NAT if you want internet from your machine).

Hi Johannes,

I have configured the 1st network adapter as a "host only" adapter and the 2nd in "NAT" mode, as described in the document. I just chose to use the 10.2.2.0/24 subnet for the host only network (you can verify this by checking the attached dhcp server config above).

So, the Windows VM is correctly receiving its static IP mapping via dhcp (10.2.2.10) but it refuses to pick up the windrbd ipxe module via the http method.

However, I made some progress :) I configured a tftp server on the Linux VM as you suggested and then configured the dhcp server to deliver ipxe-windrbd.pxe file via the tftp server.
So, now I have the Windows VM booting ! ...well up to some point...as after 20 or so minutes it timed out and went to a black screen.

I'll probably have to setup a syslog server and configure the Windows VM to send the logs there... but anyways it sounds like a good challenge :)

The drbd resource on the Linux VM is showing the following status:

root@ubuntusrv:/tftpboot# drbdadm status
windows10-boot role:Secondary
volume:1 disk:UpToDate
windows10 role:Primary
volume:1 peer-disk:Diskless

... which looks good, so it must something I missed on the Windows VM side...

Will test later and report my findings..

Many thanks,
Yannis

Cool ;) Please keep me updated.

  • Johannes

Does the DRBD device connect? You can do

drbdadm status

on the Linux side to check. Also

watch -n 1 drbdsetup status --statistics

is something I do to check if there is progress on the I/O.

Best,

  • Johannes

Yes, it does and actually I got the Windows VM up and running! There was no need for any changes, it just booted after waiting sometime... it's super slow and any action takes ages to complete, but I guess this is expected given the circumstances. By the way I'm using a Windows 10 VM.

Thanks,
Yannis

image

Hi that is really awesome that you've made it. How much RAM does the
Windows 10 VM have? In my experience when it has less than 2 GB it is
very slow (also with a real disk) ...

Thanks a lot you are the second person on this planet who made this
boot via WinDRBD ... ;)

I've set it to 4G RAM, but still it's too slow to do anything useful...
Tried today doing the same on a Windows7 SP1 64bit VM. However, this proved to be more challenging than I originally thought. In specific, I'm unable to get the WinDRBD 0.10.3 driver loading during system startup. This is due to the fact that Win7SP1 (as Win10 as well) have the driver signature enforcement enabled by default and so far I haven't found a way to disable it permanently. This normally should not be needed, as the WinDRBD driver is supposed to be digitally signed and I made sure to choose the digitally signed driver during the driver installation.
Tried also installing WinDRBD 0.10.2 with same results. Are you sure that these versions work on Win7SP1 64bit? Tried Win7 32bit, but the driver does not seem to install there at all...

image

image

Unrelated to this, there's a typo in the documentation referring to how to enable the syslog ip, the correct path should be the following (note the missing "services"):

HKEY_LOCAL_MACHINE\System\ControlSet001\services\WinDRBD\syslog_ip

Hi Yannis,

We recently noticed that our new key does not work with earlier Windows 7 versions.
The reason is unclear, to fix that one has to install all updates atop of Windows 7 SP1.
Sorry about that, I know it takes half a day to install all Windows 7 updates, but
you can do that in the background and do something else. I will also try to get the
key working somehow with Windows 7 SP1. You also can try to build WinDRBD
from the sources and use

bcdedit /set TESTSIGNING OFF

to disable driver signing requirement but building WinDRBD requires a Linux
and a Windows build environment (see the file INSTALL) and takes some time.
Feel free to write me if you have troubles building WinDRBD from scratch (not
so many people have done this before ...).

32-bit versions are not supported by WinDRBD, we might think about it (mainly to
get WinDRBD into reactos) but this is probably not going to happen soon ...

Did you try the Windows 10 machine with an directly attached disk (not via WinDRBD)?
Is it faster then? On my setup booting takes about 5 minutes, however Windows 10
and also Windows 7 then performs such that it is usable.

Thanks for the typo I will fix it soon,

Best wishes,

  • Johannes

Hi Johannes,

Thanks for confirming the issue with Win7. I will apply all updates and retry installing WinDRBD.

To be honest I don't think I will go through the route of compiling WinDRBD from source, at least not for now :)

Meanwhile, I managed to get Win10 working properly. I suspect that the reason it was so slow is because I had (mistakenly) configured the dhcp server to assign a default gateway for the host only network. So, when the VM was starting up it had 2 default gateways configured, one in host only network and one in NAT network. Obviously, I only need that on the NAT network. Could this have confused WinDRBD in such way of making the VM super slow?
Now Win10 performs very well for a diskless machine! Thanks for this, I believe that WinDRBD is the first software achieving something like this for Windows and it's quite interesting.
I believe that a diskful/diskless Win10 combination will accelerate WinDRBD, glad you have that in your future plans.
I have configured dhcp server so it can handle both Win10 and Win7 boot processes.

Many thanks for your hard work.

B.R.
Yannis

Hi Yannis,

Thank you for your contribution by testing the setup. I am not sure about the
gateway thing, but it might have confused Windows so that it sometimes does
not reach the WinDRBD server ... I am glad that you were able to fix this.

If you wish I can provide you with a not signed build so you can test with
Windows 7 also .. I know that it has troubles with signed WinDRBD drivers,
it seems that our key we are using to sign the driver is too new for Windows
7 SP1 to accept it. I know that it works with

bcdedit /set TESTSIGNING ON

but you first have to uninstall the signed driver with the WinDRBD uninstall
utility (and reboot).

Unfortunately github does not allow EXEs (for a good reason .. ;) ), so we will
have to find another way for sending the installer (gmail also won't work),

Or .. maybe better if I find a way to fix the key. Please give me some time for that.

Best regards and thanks a lot for your efforts to get this working.

  • Johannes

By the way, there is a solution for diskless Windows boot using iSCSI around
for quite a while. Also there is an older (abandoned) software that allows
diskless boot via ATA over ethernet (aoe). But both of them do not allow
a diskful / remote setup and also not to have redundant servers (at least
not at the same time). So once we implement diskful operation then it
is really a new thing ...

Best wishes,

  • Johannes

Installed all Windows7 SP1 updates and WinDRBD worked fine. I now have both systems running...

Many thanks!

image

Hi Yannis,

This is great .. sorry I was on vacation ;) Do you actually use the installations? Please
inform me when something does not work correctly (BSODs, ...).

Many thanks to you as well!

  • Johannes

Hi Yannis, that is good news. It also would help if you can document somehow what you are testing, so we also know what works for you.

Best wishes,

  • Johannes

Hi Johannes,

Today I was doing some tests on the Win10 VM, basically I was running some disk benchmarking tests by using "CrystalDiskMark" tool and all of a sudden the Win10 VM froze during the tests.
I just thought that it might be good to report it. I'm attaching the logs below...

https://gist.github.com/acidrop/50dcecbc6d3db3e6dc22a9e04706e929
https://gist.github.com/acidrop/7f06fe04cd3591a7c6c349498386ab84

Hi Yannis,

Thank you for the detailed bug report. From what I see it all started when PingAck
was not received in time by the Linux box. This could be fixed by setting larger
timeouts but timeouts are currently hardcoded. I will fix this with the next release.

Once the connection breaks it often fails to reestablish the connection, this is
also a known bug, which I will try to fix soon.

Did the benchmark start at all? How were the performance numbers? How long
did the benchmark run?

Could you maybe redo the test and see if the problem persists? That would
help a lot.

Thanks a lot,

  • Johannes

Hi Yannis, It just came to my mind, if you experiencing PingAck timeouts you can also experiment with timeout settings on the Linux side: the relevant settings are ping-timeout and ping-int:

    net {
            timeout 60;
            ping-timeout 30;
            ping-int 10;
            connect-int 20;
    }

Hope that helps,

  • Johannes

Hi Johannes,

Will test timeout settings, thanks. In the mean time, it appears that this behavior is consistent each time I run CrystalDiskMark tests with default timeout settings. I'm assuming that the Virtualbox network is getting saturated due to the stress conducted by these tests and DRBD somehow times out.

However, on my other WinDRBD setup, on a physical laptop (did I mention this ?), I'm not able to reproduce same behavior. The screenshot below is from a diskless laptop running WinDRBD which served by another laptop with Ubuntu installed. Its disk is served by an external usb adapter. The setup is running on 1Gbps switch.

image

Hi Yannis, this is good. It is probably the VirtualBox network, however I am working on getting the connection reestablished if it gets lost (which can always happen). As far as I know iSCSI would blue screen after 2 minutes or so when there is no network connection, WinDRBD shouldn't (but it should continue running when the network connection is up again, which it doesn't at the moment).

Are the performance numbers ok? How is the CPU load / network load (I can't see it on the screenshot)?

Thanks a lot,

  • Johannes

Hi Johannes,

Performance on the laptop is good giving the conditions, however I spoke too soon. The laptop also froze after several disk benchmark tests. I'm attaching both windrbd and journal log files. Check for the following date/time: 12 Dec, around 16:16 . Hope you find these useful...

https://gist.github.com/acidrop/bbedd09be6919ec3a0de907134ca3b25

https://gist.github.com/acidrop/6bb89fc13e6c93144eddd3dc4311e6c6

Hi Yannis,

I see several reboots in the night (around 0:30 and 3:30) was this a BSOD? (I assume not, because there are IRP_MJ_POWER requests), did you reboot the Windows machine? Regarding the freeze it looks just the same as the VirtualBox logs. I will try to implement/fix the reconnect,

Best regards,

  • Johannes

No, I haven't rebooted. I usually leave these 2 laptops on in the office. It has probably crashed for the same reason, or perhaps it's because the hdd which is serving the Windows laptop, is connected to an external usb adaptor, so that may go into standby mode after some time of inactivity ?

Also, I rule out the possibility of being Windows Updates, as the laptop does not have connectivity to the internet (both Ubuntu and Windows laptop are connected to a single isolated network switch).

Ok, thanks, I will get back to you once I fixed the reconnect bug,

  • Johannes

Hi Yannis,

I was working on stability of WinDRBD the past weeks. There are still some issues,
however reconnecting a WinDRBD booted Windows should work now. I've tested
this with iptables DROP on the Linux side.

I have tagged the windrbd-0.10.4-rc1 release and attached it as a zip file to this message.

If you could help by testing this release canidate, that would be of great help,

Thanks a lot,

Hi Johannes,

Thanks for that. I've just installed the new version on the Windows test laptop and ran a few tests. So far it looks good, but I will leave it running during the weekend and see how it goes.

Just to confirm, there's no need to do any changes on the linux side right ?

Regards,
Yannis

Hi Johannes,

I left the two laptops on during the weekend and today I found that the Win10 laptop had rebooted but it was stuck in the Windows boot logo stage... I'm attaching the logs in case you find something useful on them. Let me know if anything else is needed.

windrbd.log
Yannis

Hi Yannis, Thank you for your message.

What the log file tells me is that after booting successfully the Windows system
shut down (probably because of a blue screen) and then after rebooting the
WinDRBD device was connected again. It kept rebooting every 10 minutes
or so the whole weekend. What you observed is the normal (known issue)
boot delay that is there when booting from WinDRBD (it takes about 5 minutes
to boot successfully). Do you have an idea what causes the reboots? Did
you see a BSOD message on the screen?

What about the other notebook? Is is also Windows 10 or 7? Did it work on
that other notebook?

Again thanks a lot for helping testing WinDRBD,

Best wishes,

  • Johannes

Hi Yannis, I am not sure but is it an old installation? Sometimes when Windows is interrupted while booting it gets corrupted and must be re-installed. Also what I've observed is when the firewall is disabled (instead of allowing the WinDRBD ports) it fails booting. Maybe this is the cause for your Notebook which leads to the frequent reboots.

Hope that helps,

  • Johannes

Hi Johannes,

I found some time to do some further tests. First of all, I'm currently testing WinDRBD on two different setups. One, Virtualbox (1 Ubuntu VM as a server and 1 Win10 VM as a client). Two, 2 laptops (1 Ubuntu laptop as a server and 1 Win10 laptop as a client). Both setups are using identical configuration.

On both setups, I experienced a BSOD during boot (Stop error code: INACCESSIBLE_BOOT_DEVICE) after upgrading to 0.10.4-rc1. Only option to recover was reverting back to 0.10.3 (by manually attaching the disk on the client machine).

I checked both setups, and there were not FS related issues (ran chkdsk on both). I also enabled Windows Firewall on both and created an explicit rule allowing WinDRBD TCP ports.

Will leave both running for a while on 0.10.3 and see if there will be any abnormal reboots.
Let me know if you want me to do anything for checking why 0.10.4-rc1 is giving a BSOD ?

Regards,
Yannis

Hi Yannis, I will research this BSOD but my first guess is that the Virtual Bus Device must be reinstalled after an upgrade. Could you please repeat steps "Install the WinDRBD bus device" from the boot howto? You might be offered two WinDRBD drivers, there you should select the newer one (from January 2020 this should be the 0.10.4-rc1 driver)

Thank you and sorry for the bug (it didn't happen at my site, however I never updated directly from 0.10.3 to 0.10.4-rc1, I will try that now).

Do you have any logs from the not booting device (my guess is that there are none .. because the BSOD happens before WinDRBD is loaded, but I might be wrong)?

  • Johannes

Hi Yannis,

Unfortunately I failed to reproduce the INACCESSIBLE_BOOT_DEVICE BSOD on my setup. Upgrading from 0.10.3 to 0.10.4-rc1 works for me. Could you check the logs and the Bus device on your machines?

Thanks a lot,

  • Johannes

PS: It might be a good idea to re-open the bug, could you do that?

Hi Johannes,

I'm reopening the bug as you requested.
Bus device seems to be on the correct version (0.10.4-rc1), see screenshot below. I tried removing completely WinDRBD and then installing 0.10.4-rc1 from scratch. I used also the add legacy device method as described in the guide. Still for some reason I get the same BSOD.
Unfortunately there are no any clues in the syslog, probably because as you said it dies before reaching that stage. I'm adding some more info that it might be useful to you below...

image

image

Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1656]) Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 647202636 (1->2 499/146) Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot: State change 647202636: primary_nodes=0, weak_nodes=0 Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 647202636 (0ms) Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 14:45:22 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 14:47:16 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1656]) Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 713177583 (1->2 499/146) Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot: State change 713177583: primary_nodes=0, weak_nodes=0 Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 713177583 (0ms) Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 14:49:14 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 14:51:07 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1656]) Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 255178385 (1->2 499/146) Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot: State change 255178385: primary_nodes=0, weak_nodes=0 Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 255178385 (0ms) Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 14:52:56 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 14:54:28 ubuntusrv sudo[28525]: yannis : TTY=tty1 ; PWD=/home/yannis ; USER=root ; COMMAND=/usr/bin/vi /etc/drbd.d/windows10-boot.res Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 14:54:49 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1656]) Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 4098434146 (1->2 499/146) Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot: State change 4098434146: primary_nodes=0, weak_nodes=0 Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 4098434146 (0ms) Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 14:56:56 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 14:58:50 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 14:59:35 ubuntusrv sudo[17259]: yannis : TTY=tty1 ; PWD=/home/yannis ; USER=root ; COMMAND=/usr/bin/vi /etc/drbd.d/windows10-boot.res Jan 28 15:00:10 ubuntusrv sudo[23473]: yannis : TTY=tty1 ; PWD=/home/yannis ; USER=root ; COMMAND=/usr/sbin/drbdadm adjust windows10-boot Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1656]) Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 2102533793 (1->2 499/146) Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot: State change 2102533793: primary_nodes=0, weak_nodes=0 Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 2102533793 (0ms) Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 15:00:39 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 15:02:33 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1656]) Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 2067581267 (1->2 499/146) Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot: State change 2067581267: primary_nodes=0, weak_nodes=0 Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 2067581267 (0ms) Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 15:04:31 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 15:06:24 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1656]) Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 314090931 (1->2 499/146) Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot: State change 314090931: primary_nodes=0, weak_nodes=0 Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 314090931 (0ms) Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 15:08:13 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 15:09:36 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 15:09:42 ubuntusrv sudo[20073]: yannis : TTY=tty1 ; PWD=/home/yannis ; USER=root ; COMMAND=/usr/sbin/drbdadm down all Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Disconnecting ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot windows10: conn( Disconnecting -> StandAlone ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot windows10: Terminating receiver thread Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot windows10: Terminating sender thread Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( UpToDate -> Detaching ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( Detaching -> Diskless ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot/1 drbd42: drbd_bm_resize called with capacity == 0 Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot win7: conn( Connecting -> Disconnecting ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot win7: Aborting remote state change 0 commit not possible Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot win7: Restarting sender thread Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot win7: Connection closed Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot win7: conn( Disconnecting -> StandAlone ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot win7: Terminating receiver thread Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot win7: Terminating sender thread Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot/1 drbd43: disk( UpToDate -> Detaching ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot/1 drbd43: disk( Detaching -> Diskless ) Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot/1 drbd43: drbd_bm_resize called with capacity == 0 Jan 28 15:09:42 ubuntusrv kernel: drbd windows10-boot: Terminating worker thread Jan 28 15:09:42 ubuntusrv kernel: drbd windows7-boot: Terminating worker thread Jan 28 15:09:53 ubuntusrv sudo[20099]: yannis : TTY=tty1 ; PWD=/home/yannis ; USER=root ; COMMAND=/bin/bash start-drbd.sh Jan 28 15:09:53 ubuntusrv kernel: drbd windows10-boot: Starting worker thread (from drbdsetup [20103]) Jan 28 15:09:53 ubuntusrv kernel: drbd windows7-boot: Starting worker thread (from drbdsetup [20105]) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot windows10: Starting sender thread (from drbdsetup [20114]) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot win7: Starting sender thread (from drbdsetup [20116]) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: meta-data IO uses: blk-bio Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( Diskless -> Attaching ) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: Maximum number of peer devices = 1 Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot: Method to ensure write ordering: flush Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: drbd_bm_resize called with capacity == 104857600 Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: resync bitmap: bits=13107200 words=204800 pages=400 Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: size = 50 GB (52428800 KB) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: size = 50 GB (52428800 KB) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: recounting of set bits took additional 0ms Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( Attaching -> UpToDate ) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot/1 drbd42: attached to current UUID: B8CDB3B71D0D9EA6 Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: meta-data IO uses: blk-bio Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: disk( Diskless -> Attaching ) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: Maximum number of peer devices = 1 Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot: Method to ensure write ordering: flush Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: drbd_bm_resize called with capacity == 104857600 Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: resync bitmap: bits=13107200 words=204800 pages=400 Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: size = 50 GB (52428800 KB) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: size = 50 GB (52428800 KB) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: recounting of set bits took additional 0ms Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: disk( Attaching -> UpToDate ) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot/1 drbd43: attached to current UUID: 1A762AF6D414DC50 Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot windows10: conn( StandAlone -> Unconnected ) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot windows10: Starting receiver thread (from drbd_w_windows1 [20104]) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot win7: conn( StandAlone -> Unconnected ) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot win7: Starting receiver thread (from drbd_w_windows7 [20106]) Jan 28 15:09:54 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 15:09:54 ubuntusrv kernel: drbd windows7-boot win7: conn( Unconnected -> Connecting ) Jan 28 15:09:55 ubuntusrv sudo[20231]: yannis : TTY=tty1 ; PWD=/home/yannis ; USER=root ; COMMAND=/bin/bash start-drbd.sh Jan 28 15:09:55 ubuntusrv drbdsetup[20237]: new-minor windows10-boot 42 1: sysfs node '/sys/devices/virtual/block/drbd42' (already? still?) exists Jan 28 15:09:55 ubuntusrv drbdsetup[20238]: new-minor windows7-boot 43 1: sysfs node '/sys/devices/virtual/block/drbd43' (already? still?) exists Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [20136]) Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 1091434190 (1->2 499/146) Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot: State change 1091434190: primary_nodes=0, weak_nodes=0 Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 1091434190 (0ms) Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 15:12:12 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 15:14:05 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [20136]) Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 300894553 (1->2 499/146) Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot: State change 300894553: primary_nodes=0, weak_nodes=0 Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 300894553 (0ms) Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan 28 15:16:14 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 28 15:18:08 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting )

Here's my resource config file...

windows10-boot.res.txt

Hi Yannis, I just found something that might have been the reason for Windows not booting. Let me do a few tests and then I will send you 0.10.4-rc2.

Regards,

  • Johannes

Hi Yannis,

I think I have fixed the issue you are facing with WinDRBD 0.10.4-rc1. Please try the attached -rc2 file, it should now boot without issues. And, to answer something you have asked recently, on the Linux side there are no changes necessary.

Best wishes,

Hi Johannes,

Indeed that seems to have done the trick. I was able to boot from both setups, virtualbox and the laptops. I will leave them running during the night and see the outcome...

Regards,
Yannis

Hi Yannis,

Thank you for your quick response. I was experimenting with creating the bus device from within the driver and Windows got confused (one would have to reinstall the bus driver on the first reboot, if you don't then it won't boot any more). I now removed the bus driver creation from the driver code. One day the installer will learn how to do it properly ...

Best wishes,

  • Johannes

Hi Yannis, just wanted to check if your WinDRBD boxes are still running?

Thanks and best wishes,

  • Johannes

Hi Johannes,

The virtualbox setup was running fine this morning, however the win10 laptop was stuck.
Later on I realised that the reason was because the laptop had gone into standby, so it's normal windrbd to stop working since the network interface was down during that time. I don't expect that the win10 is able to recover from such state on a diskless setup, correct?
So I modified power management settings on the laptop and let it running during the day without having any issues.
Tomorrow will try to run some i/o intensive disk level benchmarks (crystaldiskmark), mainly to check if drbd will be able to cope with network congestion this time (timeout).

Regards
Yannis

So far the laptop has been rock solid, not a single reboot even after running multiple disk benchmark/stress tests. Both 2 laptops are connected to a single 1Gbps network switch, with no other computers connected to it(isolated network).

However the virtualbox setup does not seem to cope well with the disk stress tests. As soon as I run the first test, the whole Win10 VM stops responding after a few minutes. There's nothing logged from the windrbd side, but from the server side I get the below...

Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown ) Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( UpToDate -> Consistent ) Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 3998390047 (1->-1 0/0) Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 3998390047 (0ms) Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( Consistent -> UpToDate ) Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 30 10:05:18 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 30 10:05:25 ubuntusrv kernel: drbd windows10-boot tcp:windows10: initial packet M crossed Jan 30 10:05:53 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 30 10:05:53 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 30 10:05:53 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1624]) Jan 30 10:05:58 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 3435164726 (1->2 499/146) Jan 30 10:05:58 ubuntusrv kernel: drbd windows10-boot windows10: sock_recvmsg returned -104 Jan 30 10:05:58 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> NetworkFailure ) Jan 30 10:05:58 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 30 10:05:58 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 30 10:05:58 ubuntusrv kernel: drbd windows10-boot windows10: sock was reset by peer Jan 30 10:05:58 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 3435164726 commit not possible Jan 30 10:06:25 ubuntusrv kernel: drbd windows10-boot: Aborting cluster-wide state change 3435164726 (31668ms) rv = -23 Jan 30 10:06:25 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 30 10:06:25 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 30 10:06:25 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 30 10:06:25 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 30 10:06:25 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting ) Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot windows10: Handshake to peer 2 successful: Agreed network protocol version 114 Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot windows10: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot windows10: Starting ack_recv thread (from drbd_r_windows1 [1624]) Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 3263297934 (1->2 499/146) Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot: State change 3263297934: primary_nodes=4, weak_nodes=FFFFFFFFFFFFFFF9 Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 3263297934 (0ms) Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connecting -> Connected ) peer( Unknown -> Primary ) Jan 30 10:06:28 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( DUnknown -> Diskless ) repl( Off -> Established ) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: PingAck did not arrive in time. Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown ) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( UpToDate -> Consistent ) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot/1 drbd42 windows10: pdsk( Diskless -> DUnknown ) repl( Established -> Off ) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: ack_receiver terminated Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: Terminating ack_recv thread Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot: Preparing cluster-wide state change 1728991597 (1->-1 0/0) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot: Committing cluster-wide state change 1728991597 (0ms) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot/1 drbd42: disk( Consistent -> UpToDate ) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: Aborting remote state change 0 commit not possible Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: Restarting sender thread Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: Connection closed Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: conn( NetworkFailure -> Unconnected ) Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: Restarting receiver thread Jan 30 10:06:51 ubuntusrv kernel: drbd windows10-boot windows10: conn( Unconnected -> Connecting )

Not a big issue, as I'm mostly interested in real world tests on actual hardware, rather than virtualbox VMs...

Yannis

Spoke too soon...the laptop also crashed after running several disk stress tests.

drbd1.log

Hi Yannis, Thank you for your bug report. Again I will try to reproduce it and fix it with the next (0.10.5) release. Regardless I just released 0.10.4 because there are already many bugs fixed in that release. I will get back to you once I fixed what you reported.

Best wishes,

  • Johannes

Hi Yannis, it might also be that the timeouts are configured incorrectly (esp. Ping timeout). They are currently hard coded. I will create a patch so you can also configure the timeouts for the WinDRBD machine, then we can see if using other timeouts helps. Until now I failed to reproduce the crash on my setup, how long did it take until the machine freezes?

Best wishes,

  • Johannes

Hi Johannes,

On the laptop, only happened once yesterday, I haven't be able to reproduce it since then.
But on the Virtualbox setup, it happens every time I run the disk test. If I don't run the disk test, it does not freeze.
I tried modifying the ping timeout value on the Ubuntu server side (/etc/drbd.d/windows10-boot.res), but that did not seem to help, probably because as you said the vaule is hardcoded in WinDRBD.
However, I'm not getting a BSOD when this happens, it just that the VM freezes, which I guess it's normal since there's no more communication with the server.

Yannis

Hi Johannes,

In the documentation in "5.10 Configure WinDRBD logging" the reg key is incorrect, it should be:
HKEY_LOCAL_MACHINE\System\ControlSet\Services\WinDRBD\syslog_ip
Can you please amend it ?

Also, I noticed that after upgrading WinDRBD, the syslog_ip address was reset to the default value "127.0.0.1". Can the installer be modified in such way so it can preserve the value?
I'm not 100% sure about this, but it may happened when I removed completely WinDRBD and reinstalled it, so don't take this for granted.

Yannis

Hi Yannis,

Thanks for pointing that out, I will correct it. The reset of the syslog_ip happens because on uninstall all registry entries are deleted. There is no way to preserve that old setting, when installing again it is reset to the default value. What you really want to do is an upgrade (running a new installer without running uninstall first), then the syslog_ip will be preserved.

I am just preparing a new release where you can experiment with timeouts, I will get back to you later this week.

Best wishes,

  • Johannes

Hi Johannes

Thanks for the heads up. Indeed I had uninstalled windrbd at some point during the time that I had the issues with its rc version, hence the syslog_ip was removed from the registry. I have reconfigured its ip address and the syslog server is now receiving output from the client normally.
I'll wait for the new release to do some further testing.

Out of curiosity, although I admit that having a diskless (or diskful + windrbd) machine configuration is an interesting topic from the hacking perspective, I struggle to think any real world benefits from it. Can you please share with me your thoughts about what are the "real world" challenges that such a setup would solve ?

One setup that comes first in my mind is, having a centralized storage (i.e LVM/ZFS) where client machines (laptops, thinclients etc) boot from and replicate their local disks (if diskful) to/from ?
One can leverage from fast snapshots on the storage backend side to quickly rollback a client machine to it's previous state (i.e recovering from virus, o/s file corruption, disk failure).
Thinking from the backup perspective, this sounds like a good idea, however wouldn't that be a waste of storage space, as most of the client O/S would store duplicate data on the storage? Deduplication on the storage backend side would possibly solve this problem, but still I'm struggling to understand the benefits of it :)

Thanks in advance,
Yannis

Hi Yannis,

I'll wait for the new release to do some further testing.

Yes great, I probably have something ready beginning next week.

Regarding the storage if you use either LVM with thin provisioning or zfs only differences to last snapshot are stored. So in most cases a snapshot would not take that much of (real) storage, since only the differences are stored (using copy on write mechanism). Was that your concern or did I understand you wrong?

Best wishes,

  • Johannes

And yes, I think there are real world applications for a WinDRBD root device setup, even when there are no local disks on the clients. As you pointed out, disaster recovery (virus attack / file system corruption / ...) is one of them. Also having the root devices on a central storage eases maintenance tasks. Today there are setups with (vmware) virtual machines where the clients just consist of a small Linux with a VNC viewer. Other setups use iSCSI for centralizing the storage. Such setups could also be implemented with centralized storage via WinDRBD.

Just some ideas ...

Best wishes,

  • Johannes

Many thanks for your last comment, it's clears up things in my mind.

Regards,
Yannis

Hi Yannis, I am preparing Release 0.10.5 with a new syntax that allows you to control the various timeouts. Please find the documentation in the attached updated windrbd-boot tech guide. The iPXE binary also changed so you probably want to update it on your Linux server.

I experimented a lot with firewall on / off, network discovery on / off, static IP address vs. dynamic. Unfortunately it seems not to be that easy to build (or configure) Windows such that it does not need the root device for establishing a network connection (which in turn causes Windows to hang once there is a network interruption). I know that iSCSI reports I/O errors after a defined timeout which eventually causes Windows to blue screen. I hope that we can do better ... ;)

This is the RC1 for 0.10.5. Please note that you have to change the windrbd-root setting to the new (key/value based) syntax, else Windows will not be able to boot:

install-windrbd-0.10.5-rc1-signed.exe.zip

Find documentation about the new syntax in this document:

windrbd-boot-0.10.5.pdf

Happy hacking and I am curious if changing the timeouts solves your disconnect problem (be sure to use the same timeouts on both nodes).

  • Johannes

Hi Johannes,
Thanks for this.

I installed the new version and made the necessary adjustments on linux side...

  • copied the new version of ipxe-windrbd.pxe
  • copied the new version of drbd.cgi
  • modified the dhcp server config file to the new key/value system. I used both the "boot-windows.ipxe" file method and the classic approach (with the new key/value system of course).

When client is booting, I can see the key/values being parsed via the iPXE. Then after, the Windows10 boot logo shows up and there's the usual delay which is expected. But unfortunately, after a while, system is giving a BSOD (Inaccessible Boot Device).
On the linux side, I can see from drbdsetup output that the Windows10 client is failing to connect.

The only clue I can find in the logs is from apache2 error log. I'm attaching it in here.
Any ideas what it could be wrong ? I verified that the windrbd driver is updated by checking its properties in Windows device manager. I've also double checked the key/values to match my setup (see below).

#!ipxe set windrbd-root drbd:resource=windows10-boot;protocol=C;this-nodeid=2;node2.address=0.0.0.0:7690;node2.volume1.minor=42;node2.hostname=windows10;node1.hostname=ubuntusrv;node1.address=10.2.2.1:7690;timeout=60;ping-timeout=30;pingint=10;connect-int=20; sanboot http://10.2.2.1/cgi-bin/drbd.cgi?DRBD_MINOR=42

I also verified that the timeout values and the connection interval settings are the same on both sides (/etc/drbd.d/windows10-boot.res) .

Regards,
Yannis
error.log

Hi Yannis: Sorry the dashes make it difficult to configure: You forgot 2 of them: (this-nodeid should be this-node-id and pingint should be ping-int). The correct URI would be:

drbd:resource=windows10-boot;protocol=C;this-node-id=2;node2.address=0.0.0.0:7690;node2.volume1.minor=42;node2.hostname=windows10;node1.hostname=ubuntusrv;node1.address=10.2.2.1:7690;timeout=60;ping-timeout=30;ping-int=10;connect-int=20;

You can use the parser from my github account to check the syntax (as described in the tech-guide).

Hope that helps,

  • Johannes

Duh! I'm in now... Sorry for the n00b mistake, next time I'll make sure I use the parser.
Will report back my findings...

Thanks,
Yannis

No problem, I'm here to help ... ;)

  • Johannes

Hi Yannis, just wanted to check real quick if installation of 0.10.5-rc1 worked now as expected. I would then tag 0.10.5 beginning next week if you didn't find any issues.

Best wishes,

  • Johannes

Hi Johannes, Yes so far it appears to be working well. I've changed the timeout values from 30 to 80 and I'm running iometer tests since yesterday, so far it has not crashed. I'm testing it on 2 laptops.

Yannis

Thank you that is good news :) I will then release 0.10.5 on Monday...
Did you also test with the VMs? As I remember the timeouts were more
critical on the VM setup.

  • Johannes

To be honest, I tried also on the VM setup, but I'm having some issues getting the Windows guest to boot from the network. It's different from the previous time where it BSOD. It just stays at the Windows logo forever. On the linux side, the resource shows up in Standalone mode. Will try to test again tomorrow if I find some time. I've double checked the settings on the Linux side and they seem to be correct...

Yannis

I see, there are a bunch of reasons why a node goes into StandAlone mode (one of them being split-brain). Unfortunately you have to scan the logs to find out why this happened. (I have a feature request for drbdsetup to show the reasons).

Good luck, and happy weekend :)

  • Johannes

Managed to overcome the boot issue, it was my mistake, I had messed up the config again...
Ran some disk stress tests on the VM, but unfortunately it hangs after a few minutes, same as it did with previous windrbd versions.
Tried to experiment with higher timeout values, configured them both in the drbd resource file in /etc/drbd.d and in boot-windows.ipxe file, but the result was the same. I used 'drbdadm adjust res' each time I set new values on the linux side, so it can apply new settings.
I'm attaching the logs.....

Yannis
drbd9_29_Feb_log.txt
windrbd_29_Feb_log.txt

Hi Yannis, I currently also don't know the reason for:

Feb 29 11:02:30 10.2.2.10 U11:02:30.705|13bede50(drbd_r_windows10-boot) drbd_recv <6>drbd windows10-boot pnode-id:1, cs(Connected), prole(Secondary), cflag(0x2020e), scf(0x0): sock was shut down by peer

(peer being the Linux machine). Are you sure the Windows firewall is configured correctly? Can you also try to boot with the firewall disabled?

Which tests are you running? Maybe I can reproduce the bug on my system ...

Thanks and best wishes,

  • Johannes

Hi Johannes,

I tried with firewall turned off but got the same behaviour. When firewall is turned on, I have it configured as per below....

image
image

I was able to reproduce same problem on the laptops as well, I'm attaching the logs below ...

drbd_log_Mar_02_2020.txt.log
windrbd_log_Mar_03_2020.txt.log

I'm using the following "fio" command on the Windows client, for my tests...

fio --filename=test.img --size=5G --direct=1 --rw=randrw --bs=4k --iodepth=10 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1

Do you have any recommended timeout values that I could try ?

Thanks,
Yannis

Hi Yannis, Thank you for your detailed problem report. I will try to reproduce your problem and then get back to you. I am currently upgrading DRBD for WinDRBD (to 9.0.21) so please give me some time (probably until beginning next week),

Best wishes,

  • Johannes

Hi Yannis I tried your fio run now, couldn't reproduce the hang after 3 iterations, will keep on trying. How fast happens the system hang?

Best wishes,

  • Johannes

For the timeout values I don't know, I have to ask one of the Linbit guys tomorrow, I will post then,

  • Johannes

I'm able to reproduce both on VM and Laptop setup by running the fio command I posted earlier with a runtime value of "1200", so around 19 minutes. Increasing iodepth value may also "force" it to crash earlier sometimes.

Many thanks increasing iodepth helped to trigger the bug, I will see how we can fix this.

Best wishes,

  • Johannes

Hi Yannis, I've tested on a data device (non-boot WinDRBD device) and the machine also froze. It is something in the test that causes the WinDRBD driver to freeze (or at least to become very slow). Increasing ping-timeout and timeout might help you can try that: for ping-timeout the maximum is 300 (meaning 30 seconds) and for timeout the maximum is 600 (meaning 60 seconds). I am just now running a test on a (virtualized) non-WinDRBD drive, Windows is very unresponsive but still alive.

I will keep you updated,

Best wishes,

  • Johannes

Hi Johannes,

Tried increasing timeout values to their maximums, but unfortunately that did not help.

In my tests, both on laptop and VM setup, seem to be stable when the iodepth value is set at a maximum of 7. Anything above that causes windrbd to loose the network connectivity.

I'll do the same test by using drbd9 linux driver this time (both ends), see if I get same results there as well and report back.

Regards,
Yannis

Hi, yes please do. My guess is that the machine is becoming unresponsive but can finish the test run. What you also can try is to test against a data device (non-boot device) on Windows. I've also tried running the test against a non WinDRBD block device (C: on a VM with backing storage, not booted via WinDRBD), result is that the machine becomes very unresponsive but then when the test finishes recovers again.

Thanks a lot,

Best Regards,

  • Johannes

Hi Johannes,

I did some further tests, please see the results below ...

Linux VM <-> Linux VM (drbd version: 9.0.21-1)

Used diskful (non-boot) resources both sides for the tests.
Used iodepth=64 and left test running for a day, no crash.

Windows VM <-> Linux VM (windrbd version:0.10.5-rc1 | drbd version: 9.0.21-1)

Used diskful (non-boot) resources both sides for the tests.
Used iodepth=5 and the system hard crashed (cold reboot) after approximately 2min.

Conclusion:

  • When using Linux VMs both sides, and then execute fio tests on a data device, system becomes unresponsive but the drbd connection does not break.

  • When using a Windows VM on one side and a Linux VM on the other, and then execute fio tests on a data device, system (Windows) becomes unresponsive and eventually it hard crashes (cold reboot). Clearly, there must be something on Windows side which is causing this.

Attaching some logs below...hope you find something useful in them.. :-)

Mar_9_windrbd.log
Mar_9_drbd.log

Hi Yannis, Thank you for your tests, it shows what I already guessed. I can try to fix this in the next release but can't make any promises when this is going to happen (I am on vacation for one week now), I will report when I have something, ok?

Thanks a lot for the tests,

  • Johannes

Sure, no problem there's no rush from my side. Enjoy your time off! :-)

Regards,
Yannis

Hi Yannis, I hope you and your family are in good health. I continued working on the fio issue and raising priority of the ack_receiver thread made things better. Are you working at the moment? I should have something to test for you in the next few days,

Best regards,

  • Johannes

Hi Johannes, we are doing well thanks for asking, hope the same for you and your family.
I'm working from home for the last 3 weeks, but still got remote access to the VirtualBox setup, so I can do some further tests if you wish so. I recently installed the latest WinDRBD version as well (0.10.6). I noticed that the only addition was that you can specify the syslog server ip address in the boot URL, however since I don't seem to have access to the user guide (pdf) anymore, I didn't try that yet.

Regards,
Yannis

Actually the big thing about 0.10.6 is that it is based on DRBD 9.0.22. There were lots of changes between DRBD 9.0.17 and DRBD 9.0.22 that's why it took so long.

Right now I try different values for iodepth. With my new version it works with depth=300 but not with depth=600 .. but it seems to be the network layer that quits it service (I also don't get any log messages no more). It should be definately better than what we had before. Regarding your tech guide I will ask the guys at Linbit. I currently wait for the new DigiCert code signing certificate, I think I can send you a new version by end of this week.

Sorry, you are right. I forgot to mention the upgrade of the sources to DRBD 9.0.22.
Looks like that's a big progress if it can withstand that io depth. I was also thinking that this might be a limitation of Windows networking subsystem instead of something wrong in WinDRBD, but of course I don't have the in depth knowledge you have in such stuff, so I can only speculate :)

Many thanks,
Yannis

Hi Yannis, I asked the guys at Linbit about the tech guides. Unfortunately they have a new website and most of the old Links do not work any more. Could you register again for the new tech guide? I think that is the easiest solution.

Best regards,

  • Johannes

sure no problem..

Sorry for the inconvenience .. I will get back later this week once I get the code signing certificate.

Hi Yannis, unfortunately what I tried (raising priority of the asender) does not really fix the problem. I ran fio with a depth of 10 and it disconnected the DRBD resource after a few minutes, which is probably the same what you observed. I will try something else and get back later.

Happy weekend,

  • Johannes

Hi Johannes, Thanks for the heads up, that's not a problem I understand that this is a complex task.
Just out of curiosity, do you think that performing some tuning at the Windows network subsystem would help in this ? See this link as a reference...

https://docs.microsoft.com/en-us/windows-server/networking/technologies/network-subsystem/net-sub-performance-tuning-nics

Regards,
Yannis

Hi Yannis, I was working on locking implementation (spinlocks and the like) and it seems to me that this might also fix the fio bug. I am not sure, but with depth 10 it ran 2 hours without issues. I have made an installer for you, this is also the release canidate for the next 0.10.7 release. Would you please try if the WinDRBD booted devices run more stable now? Thanks a lot and best regards,

  • Johannes

install-windrbd-0.10.7-rc1-signed.exe.zip

Hi Johannes,

Ran some tests and this version appears to be much more stable than the previous. However I'm able to reproduce the bug fairly easily with iodepth=10 numjobs=4. It's stable with iodepth=10 numjobs=2 though. The exact fio command line parameters are the following...

fio --filename=test.img --size=5G --direct=1 --rw=rw --bs=4k --iodepth=10 --runtime=1200 --numjobs=4 --time_based --name=iops-test-job

Yannis

Hi Yannis,

Thank you for the tests. What exactly did you run that makes you say that it is more stable than the older version? How many CPU cores do you have configured on your test machine? I ran the --num-jobs=4 --iodepth=10 test and it didn't lose the connection, that's why I ask.

I will then make a release (0.10.7) later this week.

Best regards,

  • Johannes

Hi Johannes,

In the previous versions, the VM used to randomly crash (BSOD) or freeze without doing anything on it. For example I would leave the VM running (without running tests or anything) and the next day it would be completely frozen or crashed with BSOD. In addition, If I remember correctly, it would freeze even when running fio with numjobs=2 iodepth=10, where now it runs fine for hours. It crashes only when increasing numjobs to 3+. VM is configured with 2 cpu cores and 2GB of RAM. Do you want me to try with 4 cores and 4GB RAM instead ? Unfortunately I do not have access to the test laptops, so I can test on physical machines as well...

Yannis

Hi Yannis, This is good news. Yes please test it with 4 cores and 4 GB of RAM. It might correlate with the number of virtual CPU cores, it would be very helpful to know if the --num-jobs=4 test succeeds with 4 virtual cores.

Thanks a lot and Best regards,

  • Johannes

Hi Yannis I found that when running with --rw=randrw (instead of --rw=rw) the test succeeds (no disconnect) while with --rw=rw it loses connection after a few seconds (less than a minute). Can you confirm this? Do you know what the difference exactly is? Maybe we should do further experiments with the --rw parameter.

Thanks a lot and best regards,

  • Johannes

Yes I confirm that's the case for me as well. I was actually doing all the latest tests with --rw=rw (instead of --rw=randrw), you can confirm that by checking my yesterday's post. In my case it disconnects even when using a lower iodepth (--iodepth=8 for example). Increasing vRAM and vCores did not make any difference. I confirm that running the test with --rw=randrw --iodepth=10 succeeds. According to fio docs the differences are as follows...

Type of I/O pattern. Accepted values are:

	**read**
			Sequential reads.
	**write**
			Sequential writes.
	**trim**
			Sequential trims (Linux block devices and SCSI
			character devices only).
	**randread**
			Random reads.
	**randwrite**
			Random writes.
	**randtrim**
			Random trims (Linux block devices and SCSI
			character devices only).
	**rw,readwrite**
			Sequential mixed reads and writes.
	**randrw**
			Random mixed reads and writes.
	**trimwrite**
			Sequential trim+write sequences. Blocks will be trimmed first,
			then the same blocks will be written to.

Tested with --rw=write as well and it has same impact as with --rw=rw (drbd resource gets disconnected). It appears to be an issue with sequential writes ?

Tested also with --rw=read (sequential reads) and the system was stable, no issues.

Yannis

Hi Yannis yes that looks like we have a problem with sequential writes. I will do some tests tomorrow, maybe it is the block size (there is a DRBD limit of 1Megabytes for one request, we split that, maybe the splitting code is broken or something like this).

Thanks a lot for testing, best regards,

  • Johannes

Hi Yannis it might be useful to exchange eMail Addresses, mine ist johannes@johannesthoma.com if that is ok with you than please write me an eMail, Thanks a lot, Johannes

Hi Yannis, I've made some progress: Issue is that IoCompletion on completing master bio hangs forever. The problem does not exist on Windows 7. However, it happens with a regular disk (non-boot) device also (under Windows 10), but only if it does not have local backing storage (that is it is Diskless Primary). Unfortunately no easy fix yet..probably we need to keep track of completed IRPs and check if they are already completed. I will keep on trying...