LINBIT/windrbd

ntfs 1TB more then diskless

skdia15 opened this issue · 38 comments

We are trying to synchronize 2tb of disk
But
drbdadm primary data1 --force
enter the command It goes into diskless state.
Are there any options or workarounds for this?

================== windrbd-kernel.log ===================================

<1> U08:26:05.767|82800090(wq_windrbd_io) #1225 bio_endio Warning: thread(wq_windrbd_io) bio_endio error with err=-5.
<3> U08:26:05.767|82800090(wq_windrbd_io) #1226 windrbd_bio_finished <3>I/O failed with -5
<6> U08:26:05.860|82b97290(drbd_w_data1) #1227 drbd_khelper <6>drbd data1/0 minor 1 pnode-id:2, pdsk(DUnknown), prpl(Off), pdvflag(0x100): helper command: /cygdrive/c/windrbd/usr/sbin/drbdadm pri-on-incon-degr
<1> U08:26:06.001|82b97290(drbd_w_data1) #1228 call_usermodehelper User mode helper returned 0 (exit status is 0)
<6> U08:26:06.001|82b97290(drbd_w_data1) #1229 drbd_khelper <6>drbd data1/0 minor 1 pnode-id:2, pdsk(DUnknown), prpl(Off), pdvflag(0x100): helper command: /cygdrive/c/windrbd/usr/sbin/drbdadm pri-on-incon-degr exit code 0
<6> U08:26:06.001|82b97290(drbd_w_data1) #1230 print_state_change <6>drbd data1/0 minor 1, ds(Failed), dvflag(0x2028): disk( Failed -> Diskless )
[last message was in IRQ context or recursive]
<1> U08:26:15.779|82800090(wq_windrbd_io) #1231 bio_endio Warning: thread(wq_windrbd_io) bio_endio error with err=-5.
<3> U08:26:15.779|82800090(wq_windrbd_io) #1232 windrbd_bio_finished <3>I/O failed with -5
<1> U08:26:15.779|82800090(wq_windrbd_io) #1233 bio_endio Warning: thread(wq_windrbd_io) bio_endio error with err=-5.
<3> U08:26:15.779|82800090(wq_windrbd_io) #1234 windrbd_bio_finished <3>I/O failed with -5
<1> U08:26:15.779|82800090(wq_windrbd_io) #1235 bio_endio Warning: thread(wq_windrbd_io) bio_endio error with err=-5.
<3> U08:26:15.779|82800090(wq_windrbd_io) #1236 windrbd_bio_finished <3>I/O failed with -5
<1> U08:26:15.779|82800090(wq_windrbd_io) #1237 bio_endio Warning: thread(wq_windrbd_io) bio_endio error with err=-5.
<3> U08:26:15.779|82800090(wq_windrbd_io) #1238 windrbd_bio_finished <3>I/O failed with -5
<1> U08:26:16.780|82800090(wq_windrbd_io) #1239 bio_endio Warning: thread(wq_windrbd_io) bio_endio error with err=-5.
<3> U08:26:16.780|82800090(wq_windrbd_io) #1240 windrbd_bio_finished <3>I/O failed with -5
<1> U08:26:16.780|82800090(wq_windrbd_io) #1241 bio_endio Warning: thread(wq_windrbd_io) bio_endio error with err=-5.
<3> U08:26:16.780|82800090(wq_windrbd_io) #1242 windrbd_bio_finished <3>I/O failed with -5

Hi, did you try to reattach the disk?

bash# drbdadm attach data1

-5 means IO error are you sure your disk is working correctly?

Of course it might be a WinDRBD bug, however I did not observe it to date ..

Could you post a little bit more of the logs (you can attach a file) from where it says DRBD_ADM_PRIMARY (or better the whole logfile its just about 1500 lines ...)

Thanks for the report,

  • Johannes

Hi, Johannes
Thanks for the reply.

===========test environment========================

OS : windows server 2012 and 2019
DRBD : windrbd-1.0.0-rc7-signed
Disk:
physical disk 4TB.
parted1: 500mb (meta-disk, raw)
parted2, 2tb (disk, ntfs)

  • meta-disk internal, ntfs to raw, but the same symptoms appeared
    ** However, if the disk size is set to 1TB or less, it works normally.

===========configration ========================

resource "data1" {
protocol A;
on RND-102 {
node-id 1;
address 10.10.100.102:9007;
disk "E:";
meta-disk "D:";
device "X:" minor 1;
}
on RND-103 {
node-id 2;
address 10.10.100.103:9007;
disk "E:";
meta-disk "D:";
device "X:" minor 1;
}
}

=========== step ========================

  1. drbdadm create-md data1
  2. drbdadm up data1
  3. drbdadm primary data1 --force
  4. drbdadm status ==> diskless
  5. drbdadm attach data1 ==> outdated, not mounted, not read disk

Hi skdia15,

Thanks for your report. The relevant log line appears to be:

<4> U00:42:01.714|bf55d160(wq_windrbd_io) #281 DrbdIoCompletion <4>DrbdIoCompletion: I/O failed with error c000000d

(backing device returned INVALID_ARGUMENT on READ request).

I've tried to reproduce the error on my (Windows 7) VirtualBox environment but unfortunately could not reproduce the behaviour. I will try with other Windows variants.

Hi, Johannes
I tried different OS in a virtualized environment

-- VMware 10.0.3
-- Metadisk : 300MB
-- Disk : 1.2TB

Windows 7 ==> ok
Windows 2008 ==> ok
Windwos 2012 later ==> diskless, but In less than 1TB works well

Can you try to test above the version of Windows 2012 from your side?

Hi skdia15,

Thanks for the further testing. I now have been able to reproduce the issue an most likely also to fix it. Attached is an installer package that should work with devices >2TB. Please test with this hotfix release and let me know if that solves your problems.

Kind regards,

  • Johannes

install-windrbd-1.0.0-rc7-39-gba19a9a-signed.exe.zip

Hi, Johannes
Share my test details

-- Windows Server 2012
-- Metadisk : 1GB

  • Disk Size
  • 1TB ===> ok
  • 2TB ===> ok
  • 3TB ===> ok
  • 3.6TB ===> error
  • Status
    data1 role:Primary
    disk:UpToDate
    RND-103 role:Secondary
    replication:WFBitMapS peer-disk:Inconsistent

-- replication state not changed (I waited about 10 minutes)
-- drbdadm secondary data1 ==> bluescreen

Hi skdia,

Thanks for your report. It seems that the patch I wrote works but you discovered another unrelated bug. Are you sure that you installed the version I have sent you on both machines (just a question to make sure that the wake_up() bug is fixed on that side ...)?

Thanks, I will try to reproduce and fix.

Best wishes,

  • Johannes

May I ask which timezone you are in?

Hi Johannes,

Here's Republic of Korea (GMT+9)

patched worked very well. of course, I tested after updating both servers.

But,
As mentioned above, it is more than 3.6TB Synchronization status has not changed in WFBitMapS/WFBitMapT,
Blue screen problem occurred while trying to switch secondary again

Hi skdia15,

The blue screen on becoming secondary was also reported by another user, could you share the type of the BSOD (all info you find on the blue screen page) ... thanks that would help a lot,

Greetings to Korea (are you working for ManTech? Then my best wishes to sekim :) ...)

  • Johannes

Does the blue screen on secondary also occur on virtual machines? We suspect that it only happens with real hardware, so knowing that would be very valueable...

Thanks for Info,

  • Johannes

Hi Johannes,

Thanks for knowing Korea
But I don't work at Mantech

I will share the dump file when the blue screen occurs
https://drive.google.com/file/d/16Q-v3T1HxISEIwn7ozCUmh9Wc757xAyk/view?usp=drivesdk

And As you said, when I did the same configuration for virtualization, the blue screen did not occur.

Hi skdia15,

I've fixed something in the PnP part of the driver, WinDRBD is now doing an eject before removing the WinDRBD device. I can't tell if this fixes the BSOD on becoming secondary you've observed because I couldn't reproduce the BSOD but chances are that this version fixes the BSOD. Could you check if it does?

Thanks a lot,

Hi,

I created a version with some debug messages turned on that should help to fix the hang on drbdadm secondary. Please re-run the test with this version and send me the logs .. unfortunately I still cannot reproduce this issue.

Thanks a lot,

As previously discussed, I'm attaching the relevant logs both from Ubuntu and Win10 side...
Please note that the affected drbd resource is the "drbd-data" one.

ubuntu-drbd-log.tar.gz
windrbd-log.tar.gz

Hi Yannis, Thanks for the logs. I can't find the position where you set the device to SECONDARY. Or did you do something else and then it hang? Please explain,

Thanks

  • Johannes

Hi Johannes,

It was the last command in the sequence, but perhaps windrbd stalled and did not send anything to the syslog server ?
I'm not receiving any new entries in the syslog since I issued 'drbdadm secondary drbd-data', but the Windows10 machine it continues to operate nevertheless (in Diskless boot mode from another drbd resource).

These were the last two commands I issued before it hung:

C:\Windows\system32>drbdadm secondary drbd-data
Command 'drbdsetup secondary drbd-data' did not terminate within 600 seconds

C:\Windows\system32>drbdadm status drbd-data
Command 'drbdsetup status drbd-data' did not terminate within 5 seconds
DeviceIoControl() failed, error is 1
all: error sending config command

No currently configured DRBD found.
drbd-data: No such resource

Hi Yannis,

Then probably something from the logfile is missing are you logging remotely? Or via windrbd log-server utility? I don't see the SECONDARY command in the logs the last log entry was from Nov 13 13:12:17 ...

Hi Yannis, Hi skdia15,

I've probably fixed the hang on drbdadm secondary, not sure however if the blue screen reappeared. Please run a test with this version,

Thanks a lot,

  • Johannes

install-windrbd-1.0.0-rc8-12-g309eda3-signed.exe.zip

Hi Johannes,

I've installed rc8-12 but I get the same behaviour (drbdadm hanging/no BSOD). Just to be sure, my windrbd.sys time stamp is the following:

13/11/2020 13:40 929,240 windrbd.sys

Is this the correct version ?

Thanks,
Yannis

Looks like I'm still on rc8-11 for some reason...maybe the installer did not replace the previous version, even though I did not get any errors during rc8-12 installation. Tried re-installing rc8-12 as well, but still no luck. Is there a way to manually do this ?

C:\Windows\system32>drbdadm --version
DRBDADM_BUILDTAG=GIT-hash:\ 957ad6a83804691fb59f60e4482ac380f1dcd267scripts/unsnapshot-resync-target-lvm.sh\ user/v9/drbdadm_linux.c\ user/v9/drbdadm_windrbd.c\ build\ by\ johannes@linbit-wdrbd,\ 2020-05-07\ 09:23:52
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090019
DRBD_KERNEL_VERSION=9.0.25
DRBDADM_VERSION_CODE=0x090c00
DRBDADM_VERSION=9.12.0
WINDRBD_VERSION=windrbd-1.0.0-rc8-11-g9f9ac55-signed

Hi Yannis,

I also observed irregularities with the installer, I just did an upgrade on Windows 10 where I sign the installer I will try to fix this on Monday.

Meanwhile I found that copiing the windrbd.sys file directly into the

C:\Windows\System32\drivers

directory and reboot helps. The windrbd.sys file should be in the application folder
(C:\Program Files\WinDRBD).

Best wishes and a relaxing Weekend :)

  • Johannes

Hi,

I created a new installer not sure if it works for you, could you test it please? drbdadm --version should then say rc8-13 (after a reboot).

Thanks,

  • Johannes

install-windrbd-1.0.0-rc8-13-g57eddfe-signed.exe.zip

  • I confirm that rc8-13 installer works as expected.
  • Tested the original issue (switching the resource from Primary to Secondary and vice versa multiple times) and that seems to work well too. It takes some seconds for the drbd resource to switch to Secondary, but it does not hang and of course there's no BSOD.

Hi skdia15,

Could you maybe also check if the version 1.0.0-rc8-13 works for you? I then would make a release rc9 with this (and some other) fixes this or next week.

Thanks a lot for testing this,

  • Johannes

Hi Johannes,

I tried running the patch version, but unfortunately, an BSOD occurs.

Below I attach the settings and logs

I used is Disk1 in the image shown below.

version

disk

windrbd-kernel.log
windrbd-umhelper.log

Even though the issue with switching from Primary/Secondary (and vice versa) appeared to be solved, I'm now observing the following issue (on the laptop setup only):

On Windows 10:

  • Swtich DRBD resource to Primary and write some data onto it (fio/dd etc).
  • Switch DRBD resource to Secondary (SUCCESS).

On Ubuntu:

  • Switch DRBD resource to Primary (State change was refused by peer node).

On Windows10:

  • drbdadm down drbd-data

drbd-data: State change failed: (-2) Need access to UpToDate data
additional info from kernel:
failed to detach
Command 'drbdsetup down drbd-data' terminated with exit code 17

windrbd-kernel.zip

Hi Johannes,

Sorry, I ran the wrong test for more than 3.6tb
two disk sizes were different,
matched the volume size and proceeded again, so it worked well.

I tried running windrbd-1.0.0-rc8-signed

Thank you

Hi Skdia15,

I don't understand ... do you encounter blue screens on secondary with the rc8-13 version I've posted earlier? Yannis and I don't have any blue screens no more, please let me know if the blue screen on secondary appears to be fixed,

Thanks a lot,

  • Johannes

Hi Yannis,

I found the reason for the peer not allowed to become Primary:

Something on the Windows side is holding the DRBD device open:

<3> U11:53:59.474|f3fd91b0(drbd_r_drbd-data) #317991 change_connection_state <3>drbd drbd-data, r(Secondary), f(0x0), scf(0x0): State change failed: Peer may not become primary while device is opened read-only

The problem appears to be that something on the Windows side keeps the device open (read-only) which is legal in Linux DRBD when all peers are secondary but not in WinDRBD (where we delete the Windows device when becoming secondary). It seems that I need to force the open counts to 0 when becoming secondary (Windows not becoming Primary when there are read-only opens on the Linux side still should be supported).

Need to re-think the whole thing and I will send you a patched installer tomorrow or on Monday ...

Good work on discovering this bug, this was a hard to find one :)

  • Johannes

Hi Yannis,

I wrote a patch that should solve the problem with the peer not becoming Primary. Since I couldn't reproduce the issue I can't tell if the patch solves the problem. Could you please test with this attached rc8-19 version and let me know if it solves the problem? Could you also please attach the logs (even if the problem is solved),

Thanks a lot,

Hi Johannes,

My tests show that the issue has now been solved (congrats!).
However, I would like to point out another observation which might be worth checking.

On Windows 10 machine:

  • Switch drbd resource to Primary
  • Write some data onto it.
  • Switch drbd resource to Secondary (SUCCESS).

On Ubuntu machine:

  • Switch drbd to Primary (SUCCESS).
  • Mount drbd disk (resource is mounted in read only mode due to an unclean state)...

""The disk contains an unclean file system (0, 1).
Metadata kept in Windows cache, refused to mount.
Falling back to read-only mount because the NTFS partition is in an
unsafe state. Please resume and shutdown Windows fully (no hibernation
or fast restarting.)""

It appears that, when the resource is switched to Secondary on the Windows 10 machine, it's done forcefully, hence the ntfs filesystem is in an unclean state.

To clean filesystem state I follow these steps:

On Ubuntu:

  • Un-mount drbd disk.
  • Switch drbd resource to Secondary (SUCCESS).

On Windows10:

  • Switch drbd resource to Primary (SUCCESS).
  • Run "chkdsk /f e:" where "e:" is the drive letter for the drbd disk.
  • Open "Disk Management" and "Offline" the drbd disk. This can also be done by using "diskpart" utility.
  • Switch drbd resource to Secondary (SUCCESS).

On Ubuntu:

  • Switch drbd resource to Primary.
  • Mount drbd disk (SUCCESS). Now the resource can be mounted in a clean (read/write) state.

I believe that setting the drbd disk as "offline" before switching the resource into Secondary, on Windows 10 system, is the key for having the drbd disk dismount properly (unless I'm missing something?). Can this step be handled automatically by windrbd when switching the resource to Secondary ?

Attaching the logs both with and without the "offline" step, just in case there's something useful...

Regards,
Yannis

windrbd-kernel.zip
windrbd-kernel-offline_disk.zip

Hi Yannis,

I believe that setting disks online / offline is something that shouldn't be handled by WinDRBD. I will document the issue and write a how to for users. I will think about a little ...

Thanks for pointing that out,

  • Johannes

Also it seems that just setting the device offline doesn't solve the problem. One probably has to run the checkdisk command.

Best wishes,

  • Johannes

Seems that running chkdsk also does not solve the unclean mount...hmm...have to find another solution,

Kind Regards,

  • Johannes

I have created a simple batch script to test this specific issue and I don't have any issues (no chkdsk is need) mounting the ntfs filesystem on linux:

@echo off
drbdadm primary drbd-data
timeout /t 3
diskpart /s c:\temp\disk-online.scr #Bring DRBD disk online#
timeout /t 3
cd /d e:
del test8.io
dd if=/dev/zero of=test8.io bs=1M count=500 --progress
timeout /t 3
cd /d c:
diskpart /s c:\temp\disk-offline.scr #Bring DRBD disk offline#
timeout /t 3
drbdadm secondary drbd-data

Hi skdia15,

If this issue is solved then please close the ticket, I do not have permission to do so,

Thanks a lot,

  • Johannes