lvmteam/lvm2

Allow bypassing lv_cache_wait_for_clean() when spliting/uncaching with dirty blocks

Opened this issue · 4 comments

We have a 1.5TB SSD write-through cache-pool caching for a ~65TB HDD LV on our production system several years ago, and the most disturbing issue of lvmcache is that, after rebooting all chunks (~25 million) in cache-pool become dirty (although theoretically this should not happen with a clean shutdown). And as dm-cache does not like to do migration when there's some loads with IO (until recently the cleaner policy finally ignores the IO idle status), if we forget to uncache before maintenance window, what we get would be a very large "fake dirty" cache that could take infinity time to flush its dirty blocks. The workaround we find now is to run a script to keep modifying migration_threshold to force dm-cache to do this job:

sudo lvchange --cachepolicy cleaner example/repo
for i in `seq 1 1500`; do sudo lvchange --cachesettings migration_threshold=2113536 example/repo && \
 sudo lvchange --cachesettings migration_threshold=16384 example/repo && echo $i && sleep 15; done;
# if there're still dirty blocks, continue run the for loop
sudo lvchange --cachepolicy smq lug/repo

which is very stupid and takes ~10 hours to finish. And considering that all other I/O operations are blocked when dm-cache is migrating, it is actually a huge service disruption. Another workaround is to modify LVM metadata manually and use vgcfgrestore -f to load, which is an error-prone approach.

Searching with "Flushing blocks for cache" and "writethrough reboot", there are also other reports that they meet similar issues with write-through cache unable to be uncached with LVM tools after a (dirty) shutdown, waiting flushing forever:

LVM2 tools might not help resolve this issue directly -- this is a kernel dm-cache implementation issue (there's a research paper trying to do this in USENIX ATC21, though). However, lvm2 could help if users are confident enough that, the cache is originally write-through, no actual dirty blocks are in the cache, and accept a new option (or just with --force) to skip flush when they want to do uncache or splitcache operation with the cache.

On the 'clean' shutdown the cache is supposed to remain clean - thus if you get dirty-cache pool in your case there is certainly something wrong.

What is the kernel version in use ?

BTW - you should be always able to 'forcibly' uncache if I'm not mistaken.
If this doesn't work - this is the bug to work on.

But let's first start with versions in use (clearly we can't provide fixes for old kernels and old lvm2)

Thank you for your reply. Our production server is currently running Debian 11 (5.10.0-21-amd64 and lvm2 2.03.11-2.1), whose workload is something like a static file server (nginx + git + rsync).

I don't think that the dirty block issue would happen when the volumes are inactivated correctly. And from what I remember, the last time it rebooted it looks like git server processes could not be killed (I/O is too slow and it stuck in kernel maybe?), and finally systemd seemed to have triggered a force reboot. This seems not like a LVM bug and obviously I could not verify my hypothesis as I can't just reboot the server on a whim.

And after we got a very large dirty-cache pool, with existing workload and unmodified LVM tools we could not just perform operations like uncache as expected.

I don't think that the dirty block issue would happen when the volumes are inactivated correctly. And from what I remember, the last time it rebooted it looks like git server processes could not be killed (I/O is too slow and it stuck in kernel maybe?), and finally systemd seemed to have triggered a force reboot. This seems not like a LVM bug and obviously I could not verify my hypothesis as I can't just reboot the server on a whim.

I believe those dirty blocks resulted from the unclean shutdown. If you'd rather not wait for cache flushing in this situation, you can deactivate the cache volume (either by using lvchange -an or dmsetup remove). After deactivation, you should be able to uncache the offline volume with lvconvert. Deactivating an unclean-shutdown writethrough cache writes out those "fake dirty" bits to cache metadata. However, as far as I know, it shouldn't block lvm uncache operations if the volume is offline.

If there's any case that could produce dirty blocks with normal clean shutdown, let us know please.

And thank you for sharing the paper link.

I believe those dirty blocks resulted from the unclean shutdown. If you'd rather not wait for cache flushing in this situation, you can deactivate the cache volume (either by using lvchange -an or dmsetup remove). After deactivation, you should be able to uncache the offline volume with lvconvert. Deactivating an unclean-shutdown writethrough cache writes out those "fake dirty" bits to cache metadata. However, as far as I know, it shouldn't block lvm uncache operations if the volume is offline.

Thank you for sharing this method, and I would give it a try next time we get into this situation (though I still consider it might be better if this could be done online).