Extrem Slow Write in VM KVM/QEMU
nillin42 opened this issue · 10 comments
Hi Developer,
I wanted to test Performance of VDO in a Virtual Machine. My test environment was Debian 11 KVM/QEMU. I created a VM with Q35, Virtio SCSI with 20 GB and Rocky 8.5 and 9.1 OS. Created a VDO with compression and ontop a XFS FS. Copied files from second vdisk (other Harddisk) to VDO/XFS and got writes of 3 MB/s. Without VDO I got nearly 100MB/s. There are no CPU or RAM bottlenecks. Exists there a problem with VDO in virtual machines? There is no hint in the Redhat docu.
regards
Does your VM sit on top of RAID storage, or networked storage, or anything else that might make flushing data to stable storage incur a significant latency? VDO is pretty aggressive (probably more than necessary) about flushing any write-back caches and waiting for stuff to make it to stable storage. XFS, on the other hand, tries very hard to minimize flushes, sometimes doing them only every 30 seconds or so.
If you’re dealing with storage that doesn’t support flushes, the issue is probably somewhere else.
Some things to look at first within the VM would be:
/sys/block//queue/write_cache
This virtual file may say “write back” if caching is enabled, “write though” if we don’t need to worry about flushing writes.
sar -d 1
This command (or the iostat equivalent) should show us, second by second, whether there’s I/O activity happening from VDO to storage, and how fast, or whether the issue is somewhere else.
Another thing to check is if the VM is consuming a lot of resources in the host environment, triggered by VDO usage patterns -- for example, lots of cycles processing interrupts. Are you able to see if qemu is using lots of CPU cycles, or if the host environment reports excessive numbers of interrupts?
Our performance testing has generally been on real, fast hardware (targeting server farm environments) rather than VMs, so it’s possible there are issues we haven’t seen.
Does your VM sit on top of RAID storage, or networked storage, or anything else
No. The constellation is: harddisk-->ext4-->qcow2-->vdo-->xfs (sdc is the device)
If you’re dealing with storage that doesn’t support flushes, the issue is probably somewhere else.
So, i understand that there should be no flushes under vdo.
Last, i use "none" for puffer mode and ignore for discard. Now, i set puffer to "writethrough", discard to "ignore" and zeroes to "off"
output of "sar -d 1":
Much slower than 3MB/s.
Another thing to check is if the VM is consuming a lot of resources in the host environment
No, there is boredom
or if the host environment reports excessive numbers of interrupts?<
How can i see this?
Hello,
I would like to participate in this issue because, although I have a significantly different setup, I think I encounter the same type of slowness. My storage stack is as follow :
Hardware RAID 6 (sda) - Dell PERC H710P controller with 1GB cache + battery
└─Partition (sda3)
└─2 LVM+VDO volumes
└─EXT4
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 54,6T 0 disk
├─sda1 8:1 0 512M 0 part /boot/efi
├─sda2 8:2 0 488M 0 part /boot
└─sda3 8:3 0 54,6T 0 part
├─Ultron--vg-root 254:0 0 5,9G 0 lvm /
├─Ultron--vg-swap_1 254:1 0 976M 0 lvm [SWAP]
├─Ultron--vg-ZEROED--VDO--POOL--1_vdata 254:2 0 14,7T 0 lvm
│ └─Ultron--vg-ZEROED--VDO--POOL--1-vpool 254:3 0 15,6T 0 lvm
│ └─Ultron--vg-ZEROED--VDO--LV--1 254:4 0 15,6T 0 lvm /mnt/ZEROED-VDO-LV-1
├─Ultron--vg-COMPRESSED--DEDUPLICATED--VDO--POOL--1_vdata 254:5 0 1T 0 lvm
│ └─Ultron--vg-COMPRESSED--DEDUPLICATED--VDO--POOL--1-vpool 254:6 0 1,2T 0 lvm
│ └─Ultron--vg-COMPRESSED--DEDUPLICATED--VDO--LV--1 254:7 0 1,2T 0 lvm /mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1
└─Ultron--vg-temp 254:8 0 100G 0 lvm /mnt/temp
/var
, /tmp
and /home
directories are on COMPRESSED-DEDUPLICATED-VDO-LV-1
(so basically all my Docker
volumes)
# cat /sys/block/sda/queue/write_cache
write back
Should I stop using RAID controller's cache ?
XFS, on the other hand, tries very hard to minimize flushes, sometimes doing them only every 30 seconds or so
So, is it a bad idea, from a performance perspective, to use EXT4 filesystem instead of XFS on VDO volumes ?
Last, i use "none" for puffer mode and ignore for discard. Now, i set puffer to "writethrough", discard to "ignore" and zeroes to "off"
I don't get, what is "puffer mode" ? And how do you change those settings ? In your VDO volume settings file ?
Here is sar -d 1
output while running dd if=/dev/zero of=/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1/temp.raw bs=1M count=1000 oflag=dsync
and dd
output : 1048576000 octets (1,0 GB, 1000 MiB) copiés, 5,95667 s, 176 MB/s
And here is the output when running dd if=/dev/zero of=/mnt/temp/temp.raw bs=1M count=1000 oflag=dsync
on non-VDO volume :
and dd
output : 1048576000 octets (1,0 GB, 1000 MiB) copiés, 2,36494 s, 443 MB/s
I also permanently have what I think is a high iowait
: 2% to 5% during normal operations and it can go to 11% when running previous dd
command. So I get the following warning in glances
monitoring interface :
or 38% running fstrim
command :
nillin42:
Does your VM sit on top of RAID storage, or networked storage, or anything else
No. The constellation is: harddisk-->ext4-->qcow2-->vdo-->xfs (sdc is the device)
If you’re dealing with storage that doesn’t support flushes, the issue is probably somewhere else.
So, i understand that there should be no flushes under vdo. Last, i use "none" for puffer mode and ignore for discard. Now, i set puffer to "writethrough", discard to "ignore" and zeroes to "off" output of "sar -d 1":
Much slower than 3MB/s.
So VDO’s backing storage isn’t a raw disk device directly, it’s a file in ext4, by way of qemu disk emulation, correct? File system overhead might be an issue, especially if “writethrough” translates to “synchronous I/O”. (I don’t know if it does or not.)
Two things I would suggest trying:
- Move the disk image to LVM without a file system, if you’ve got the space
- Try write-back caching instead of write-through.
Either of these, or the combination, might improve things, depending where the slowdown is coming from.
Another thing to check is if the VM is consuming a lot of resources in the host environment
No, there is boredom
or if the host environment reports excessive numbers of interrupts?<
How can i see this?
/proc/interrupts has counters per interrupt type and per cpu; rapidly increasing numbers may indicate a lot of interrupts.
So I change the qcow2 file to direct access to a partition- also without lvm layer: diskpart-->virtio/scsi-vm --> lvm/vdo-->xfs,
changed puffer mode to none or directsync. I checked also interrupts. I dont see a high amount. The results do not get better - also between 1-3 MB/s.
Maybe I also install a rocky in a separate partition to bypass virtual environment.
So I installed Rocky 8.5 on baremetal. VDO version 6.2.7.17. It really sucks. Bugs over bugs in Redhat. Incredible.
I used a disk partition as LVM VDO with xfs.
Dedup is disabled. For Copying 20 GB data in a single file it needs over 10 min. What a fuck.
Test with dd if=/dev/urandom of=/mnt/dedupdev/testimage.img2 bs=4K count=262144
show 11sec.
My normal copy command is cp. Source is a zfs mirror. Why there are such a difference?
fio test for random write with: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/mnt/dedupdev/random_read_write.fio --bs=4k --iodepth=64 --size=250M --readwrite=randwrite
Strange is also the compression ratio. I dont see one. After copying 20GB Testfiles = 1.Try -> vdostats say:
Did it compress anything?
Also strange is that i dont see any CPU usage while copy process. I'm done with it. I think vdo works only in special environments that i cannot use at the moment. When u want to have compression and deduplication then going with btrfs is much easier.
tigerblue77 wrote:
Should I stop using RAID controller's cache ?
It’s worth experimenting. From what I’ve seen with basic write-back caching, the results could go either way. With a RAID 6 configuration, my guess would be the cache may be useful for collecting full or nearly-full stripes together for writing (simplifying checksum calculation), whereas VDO’s write pattern can be a bit more random -- we shuffle things a bit to try to write consecutive blocks together, but close-but-not-consecutive writes aren’t reordered, and VDO doesn’t know anything about the RAID stripe configuration.
XFS, on the other hand, tries very hard to minimize flushes, sometimes doing them only every 30 seconds or so
So, is it a bad idea, from a performance perspective, to use EXT4 filesystem instead of XFS on VDO volumes ?
No, I wouldn’t say that. I just meant that it’s one area where VDO’s performance (or EXT4/XFS on VDO) is likely to look poor compared to file systems on raw storage, because we do so much flushing of buffers. XFS gets away with fairly infrequent flushes (e.g., a couple every 30 seconds), yet XFS-on-VDO will send many flushes to the underlying storage, perhaps several per second.
Without having dug into it deeply yet, I would blithely assert :-) that in the XFS-on-VDO case, we probably shouldn’t need to issue many flushes when we haven’t received a flush from XFS (or whatever’s above us, in the general case). Though it may take some significant work on the code to get it to send fewer flushes, safely.
It may be the case (I haven’t checked) that EXT4 might send more flushes to VDO than XFS, but if it’s not a high rate I doubt it makes much difference.
All that said, I haven’t actually done performance tests of different filesystems atop VDO to see how they fare. If you do, I’d be interested in the results. The tests I've done generally involve writing directly to the VDO device.
Here is
sar -d 1
output while runningdd if=/dev/zero of=/mnt/COMPRESSED-DEDUPLICATED-VDO-LV-1/temp.raw bs=1M count=1000 oflag=dsync
and
dd
output :1048576000 octets (1,0 GB, 1000 MiB) copiés, 5,95667 s, 176 MB/s
This (neighborhood of 200MB/s, with a fair bit of variability moment to moment) looks more like what I might expect from VDO, depending on the hardware. Though sending zero blocks is a bit of a cheat, as we don’t store zero blocks except as a special mapping from the logical address. OTOH, if it was a newly created device, the first writes have to populate the block-address-mapping tree, even if you’re writing zero blocks, and we’ve noticed a bit of a performance hit in the tree initialization until it’s fully allocated, at least for the logical-address region your test cares about.
I’ve gotten over 1GB/s write throughput (of nonzero data) in a higher end server configurations, but it took some tweaking of thread counts, CPU affinities, and other stuff.
And here is the output when running
dd if=/dev/zero of=/mnt/temp/temp.raw bs=1M count=1000 oflag=dsync
on non-VDO volume :
and
dd
output :1048576000 octets (1,0 GB, 1000 MiB) copiés, 2,36494 s, 443 MB/s
So we got about 40% of the raw disk throughput with VDO? Not too bad to start with.
If you want to tune it, I’d look at whether any of VDO’s threads are using a lot of CPU time. For the thread types with thread-count parameters, aside from the “bio” threads, if the CPU usage is over 50%, bump up the thread count to distribute the load better; if it’s below… oh, maybe 20%… try dropping it by one, to reduce the thread switching needed.
-
The UDS index handling thread count currently can’t be adjusted.
-
The journal and packer threads are always unique, but in some of the high end configurations I’ve been able to push VDO fast enough for the journal thread to become a CPU bottleneck (near 100% utilization), and the packer thread not far behind. And even without approaching 100% utilization, if utilization gets high, the queue times for tasks becomes significant.
-
The “bio” threads are for sending I/O to the storage device, and are expected to block. Adjusting the count up or down might help too, but CPU utilization isn’t a useful guide.
If your system has a NUMA configuration, there’s a lot of tweaking of CPU affinities to look at too. Actually, it might help in a non-NUMA configuration with many cores, but I haven’t explored that. Cache contention between CPU modules is generally a bigger issue than between cores in the same CPU. (And in tools like “top” it looks like busy threads, because they stay on CPU while waiting to load data.)
Some XFS tweaks may help, too -- since XFS will frequently rewrite its logs and they’re unlikely to usefully deduplicate, you could add a non-VDO volume in your RAID volume group, alongside VDO, to use as an external XFS log.
nillin42 wrote:
So I installed Rocky 8.5 on baremetal. VDO version 6.2.7.17. It really sucks. Bugs over bugs in Redhat. Incredible. I used a disk partition as LVM VDO with xfs. Dedup is disabled. For Copying 20 GB data in a single file it needs over 10 min. What a fuck.
I’m sorry to hear you’re having such problems setting it up.
Writing 20 GB in 10 minutes is about 34 MB/s. That’s not great, but it’s a 10-fold increase over your initial report.
I assume when you say “dedup is disabled” you mean XFS deduplication? Or did you somehow get a configuration with VDO’s deduplication turned off?
I ran a test in a VM running Rocky 8.7 (host environment: ThinkPad T580 laptop running Fedora 35, disk images stored in ext4), created using “vagrant” and this configuration:
Vagrant.configure("2") do |config|
config.vm.box = "generic/rocky8"
config.vm.provider "libvirt" do |vb|
vb.memory = "2048"
vb.cpus = "2"
vb.storage :file, :size => "30G", :type => "raw"
end
config.vm.provision "shell", inline: <<-SHELL
dnf update -y
dnf install -y lvm2 vdo kmod-kvdo vdo-support
SHELL
end
I don’t know what sorts of bugs you encountered setting it up, but this part was pretty straightforward for me. A reboot of the VM was needed because the update pulled in a new kernel.
After creating an LVM VDO volume in the second disk, creating an XFS file system on it (no extra command-line options), and creating a 17 GiB test file (tar image of /etc and /usr, some of it compressible and some not, replicated 10 times with 3 extra bytes in between so as to shift the contents relative to block boundaries and not create identical runs of blocks), I tried copying it into the XFS file system:
[root@rocky8 /]# time cp /z /mnt/ ; time sync
real 5m54.916s
user 0m0.588s
sys 0m34.146s
real 0m5.620s
user 0m0.003s
sys 0m0.008s
So, about 6 minutes to copy and sync. That’s about 46 MiB/s.
Now, VDO has an unfortunate extra speed penalty when it’s first created, when the block-address-mapping tree hasn’t been allocated yet, as I mentioned in my reply to tigerblue77. The first time an address in a range is written to (including a zero block but not including discards), one or more blocks in the tree may need to be allocated and written. Currently, VDO serializes every step of this, and it slows those first writes down a bit.
So I tried another test: Without tearing down the VDO device, which now has a bunch of the tree blocks allocated, I unmounted the file system, ran “blkdiscard” on the VDO volume, created a new file system and mounted it. I also created a new test file which shifted the contents of the previous test file by one byte, to prevent trivial deduplication against blocks that had been stored earlier in my test. And I copied the file again:
[root@rocky8 /]# time cp /z2 /mnt/ ; time sync
real 4m45.290s
user 0m0.487s
sys 0m29.586s
real 0m4.900s
user 0m0.002s
sys 0m0.011s
More than a minute faster, at almost 57 MiB/s. Not stellar, but for good performance, as I indicated earlier, I’d want to remove the virtualization layer and file system layer underlying it.
Test with
dd if=/dev/urandom of=/mnt/dedupdev/testimage.img2 bs=4K count=262144
show 11sec.My normal copy command is cp. Source is a zfs mirror. Why there are such a difference?
I don’t know about dd vs cp, off the top of my head, but copying from /dev/urandom could trip another performance issue. When you read from /dev/urandom, that process cranks through the PRNG code in the kernel to generate your data.
I tried a dd writing directly to the VDO volume (no XFS, but not a freshly created VDO volume, so block map tree already allocated) in my Rocky VM. Doing a dd from /dev/urandom to the VDO volume took just over 30s. Doing a dd from /dev/urandom to a file in the root file system took nearly 30s, and the dd process showed nearly 100% CPU utilization. (For any kernel hackers in the audience, “perf” says nearly all of the CPU cycles are going to _raw_spin_unlock_irqrestore.) Doing a dd from that file to the VDO volume took just under 10s. A copy and sync of a random file to XFS-in-VDO took around the same. For comparison, doing a dd from /dev/urandom to a file in the host environment instead of the VM, but on the same hardware, takes under 3.5s.
Also, when you write to VDO, we make a copy of the data block in order to do compression and deduplication work after the initial acknowledgement of the write operation, and currently that copy is performed in the calling thread. So if the dd thread is both reading from urandom and writing directly to VDO, it incurs a lot of extra CPU overhead that, with a little care, can be distributed instead.
For these reasons, if you want to test with randomly generated data, I suggest writing that data to a file first, especially if you’re using a virtualization environment. With fio’s user-mode PRNG or in a non-virtualized environment, it’s less of an issue, but worth keeping in mind.
fio test for random write with:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/mnt/dedupdev/random_read_write.fio --bs=4k --iodepth=64 --size=250M --readwrite=randwrite
Once again I’d wonder about the block map tree allocation issue. XFS allocation patterns might also play into it but I haven’t investigated.
I tried this test (still in my VM) and it kept completing in just a few seconds. So I increased the size 10-fold, to 2500M. The first run took almost 70s, but the next one took under 20s. I deleted the file and tried again: 59s, then 11s. With repeated tests, or writing to a second file, the numbers vary, but the general pattern seems to hold.
Strange is also the compression ratio. I dont see one. After copying 20GB Testfiles = 1.Try -> vdostats say:
Did it compress anything?
If this test also used /dev/urandom, or if fio is generating random data, then no; it should’ve tried, but it would’ve failed because random bytes aren’t compressible. (And VDO doesn’t store the compressed version unless it compresses by a little better than 2:1.) Unless you’re making multiple copies of the same random file, deduplication would’ve also consumed CPU cycles and wound up saving no space. So every block seen would be new and would require a full block to store. It’s not really a good demo of VDO’s capabilities, though of course we should be able to handle it with okay throughput.
For our internal testing, we use a modified version of fio that can be told what sort of duplication pattern to generate in the synthesized data, and we generally specify how compressible the test data should be.
Also strange is that i dont see any CPU usage while copy process.
That sounds amiss. VDO is actually a moderate consumer of CPU cycles, for hashing, index scanning, and compressing, whether deduplication and compression are successful or not. Much of the CPU usage is distributed across a bunch of kernel threads, but overall usage adds up. In my test above, the 2 CPUs given to the test VM ran at a little over 50% utilization.
If you’re not seeing much CPU utilization at all, most likely that means it’s not getting data very fast (e.g., dd spending too much time reading from urandom), or it’s not able to write and flush data to the backing storage very fast and so the whole pipeline backs up (in which case, compressible or duplicated data may get you better performance).
I'm done with it. I think vdo works only in special environments that i cannot use at the moment. When u want to have compression and deduplication then going with btrfs is much easier.
Our primary target is server type environments with lots of storage and lots of processing power, running directly on real hardware -- file servers, virtualization servers, stuff like that. But it should work on smaller systems too, unless they’re really underpowered. On a really small system, there might not be enough duplication of data for VDO to be worth the overhead. (We really don’t focus much effort on compression-only setups, but we’ve discussed it.)
Most of our team’s lab test machines are either VMs or server systems with lots of RAM and CPU, so a test on “normal” hardware (whatever that means) instead of a server is a little tricky for me to set up.
Oh, I mis-spoke above... in the tests I did a couple years ago, it was btrfs that only flushes its data to disk every 30 seconds, not XFS.
thanks to make a test environment and post the results. My goal was to test VDO with compression and without deduplication enabled, but to make deduplication offline by using xfs because i think vdo deduplication works similar ressource intensiv than that of zfs. My target use case is virtualization host in HCI.
I'm not sure but u use vdo with active deduplication. Then the results are not so bad if your underlying disk performs with around 100MB/s.
The performance issue u mentioned according dd i dont face in my test thats why my question of the different performance. For random dd i copy 1 GB in 11 sec = 93 MB/s (nearly max native disk speed).
So something in my test environment is buggy but i dont know what.