abbbi/virtnbdbackup

LZ4F_decompress failed with code ERROR_blockChecksum_invalid

sbrunsch2024 opened this issue · 15 comments

Version used
Provide output of virtnbdbackup -V

1.9.38

Describe the bug
restore Error

Expected behavior
successful restore

Hypervisor information:

  • OS: [e.g. Alma, Debian]
  • HV type [e.g. opennebula, plain libvirt, rhev]
    libvirt kvm@ debian

virtnbdrestore -i /vmbackup/routervm/202309 -o /kvm/storage -N routervm

ERROR root virtnbdrestore - restoreData [MainThread]: LZ4F_decompress failed with code: ERROR_blockChecksum_invalid
Traceback (most recent call last):
File "/usr/bin/virtnbdrestore", line 142, in restoreData
written = chunk.read(
^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/libvirtnbdbackup/chunk.py", line 83, in read
data = lz4.decompressFrame(reader.read(blocklen))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/libvirtnbdbackup/lz4.py", line 27, in decompressFrame
return lz4.frame.decompress(data)

No workaround

abbbi commented

Hi, This Looks Like a corrupt Block in the data file. If compression was enabled during backup the lz4 layer also stores an checksum alongside the compressed block and it appears this checksum doesnt Match anymore. Any notable issues on the filesystem the data is stored on? Is it reproducible with other backups?

I had 2 vms which had this error. I made a fresh backup of both and got the same error with one remaining vm (the biggest one).
Just trying a new backup without compression. But 300GB data take some time for backup and restore.

Target is a NFS-Server with TrueNAS Scale. Message-file from Server and NAS seem to be clean.

Lessons learnt: I think I will do regular restore test.

abbbi commented

Is it reproducible it you Backup/ restore to local storage on the hypervisor host? Granted i dont use compression all that often .. at least not with vm this big. The checksum is calculated by the lz4 Module automatically.

One reason could be that the complete frame read during restore is wrong But that would end in a different error..

restore log with verbose log might show more details, but its more likely the data beeing corrupted somehow.

Maybe there is a way to disable the Check during decompress, to See if its able to accomplish decompression even if the checksum is wrong, Need to Check the lz4 modules docs on that

abbbi commented

i just attempted to reproduce this with a backup i did. After backup i replaced one byte within the backup file and attempted restore:

ERROR root virtnbdrestore - restoreData [MainThread]: LZ4F_decompress failed with code: ERROR_blockChecksum_invalid

but there seems to be no way to ignore the checksum check (which is obviously OK)

Also in my case, just before the decompression the frame information for the lz4 block is shown:

[2023-09-04 21:12:46] DEBUG root lz4 - decompressFrame [MainThread]: Compressed Frame: {'block_size': 65536, 'block_size_id': 4, 'block_linked': True, 'content_checksum': True, 'block_checksum': True, 'skippable': False, 'content_size': 131072}
[2023-09-04 21:12:46] ERROR root virtnbdrestore - restoreData [MainThread]: LZ4F_decompress failed with code: ERROR_blockChecksum_invalid

which means the data read from the backup file is a correct frame, but its data contents dont match the checksum anymore. From the lz4 docs:

content_checksum (bool) – Specifies whether to enable checksumming of the uncompressed content. If True, a checksum is stored at the end of the frame, and checked during decompression. Default is False.
block_checksum (bool) –
Specifies whether to enable checksumming of the uncompressed content of each block. If True a checksum of the uncompressed data in each block in the frame is stored at

the end of each block. If present, these checksums will be used to validate the data during decompression.

both options are enabled by default during backup.

No messages in dmesg or alike where transfer issues via nfs might be the reason?

I made a backup to different NFS-Server and I could restore it.
Just doing the restore, but I have to backup my nextcloud data to be safe that I have a working backup. So Backup / Restore jobs now working simultanously.

Two different servers had the same problem...

Network errors? FS-errors?
I will schedule a zfs scrub.

abbbi commented

ive commited a new feature into the "checksum" branch: if backup to regular directory is attempted it now computes an adler32 checksum for the complete data file contents and reports them via logfile:

[2023-09-05 09:54:28] INFO output target - checksum [vda]: Checksum for file: [/tmp/chksum/vda.full.data.partial]:[3486281336]
[2023-09-05 09:54:35] INFO output target - checksum [vda]: Checksum for file: [/tmp/chksum/vda.inc.virtnbdbackup.1.data.partial]:[1914456873

then there is a verify option that allows for verifying that checksum:

./virtnbdrestore -i /tmp/chksum/ -o verify
[2023-09-05 09:54:48] INFO lib common - printVersion [MainThread]: Version: 1.9.39 Arguments: ./virtnbdrestore -i /tmp/chksum/ -o verify
[2023-09-05 09:54:48] INFO root virtnbdrestore - verify [MainThread]: Computing checksum for: /tmp/chksum/vda.full.data
[2023-09-05 09:54:50] INFO root virtnbdrestore - verify [MainThread]: Checksum: 3486281336
[2023-09-05 09:54:50] INFO root virtnbdrestore - verify [MainThread]: Computing checksum for: /tmp/chksum/vda.inc.virtnbdbackup.1.data
[2023-09-05 09:54:50] INFO root virtnbdrestore - verify [MainThread]: Checksum: 1914456873

so altered files can be identified more easily no matter if compression is used or not.

In this case it seems to be a problem with the storage. Truenas found some failures in zpool.
But checksum would be very nice, so restore test has not tb done so often...

abbbi commented

thanks for the update! Closing this one ..

abbbi commented

@sbrunsch2024 latest master now contains this feature, see:

https://github.com/abbbi/virtnbdbackup#verifying-created-backups

of course only works with new backups. Could you test? Thanks.

abbbi commented

Last questions before waiting for reaction…. How to modify a bit of a backup file? dd ?

hexeditor or dd

Is there a way to speed up checksum creation? It takes about 3 minutes for 30 GB on rotating disks.

you mean verify? Its using 4k blocks to compute the checksum during verify, that might be a bottleneck.
I just added option -B to be able to specify a bigger buffer size.
Default is now io.DEFAULT_BUFFER_SIZE (8k) but might be changed accordingly to suite the environment:

 virtnbdrestore -i /tmp/BACKUP -o verify -B 8096 -v
 DEBUG root virtnbdrestore - verify [MainThread]: Using buffer size: 8192
 [..]
 virtnbdrestore -i /tmp/BACKUP -o verify -B 16384 -v
DEBUG root virtnbdrestore - verify [MainThread]: Using buffer size: 16384
abbbi commented

or simply append something to the backup file:

# echo 1 >> /tmp/BACKUP/vda.copy.data
# ./virtnbdrestore -i /tmp/BACKUP -o verify -B 16384 -v
[2023-09-06 08:43:16] INFO lib common - printVersion [MainThread]: Version: 1.9.41 Arguments: ./virtnbdrestore -i /tmp/BACKUP -o verify -B 16384 -v
[2023-09-06 08:43:16] DEBUG lib common - getLatest [MainThread]: Sorted data files: 
['/tmp/BACKUP/vda.copy.data']
[2023-09-06 08:43:16] DEBUG root virtnbdrestore - verify [MainThread]: Using buffer size: 16384
[2023-09-06 08:43:16] INFO root virtnbdrestore - verify [MainThread]: Computing checksum for: /tmp/BACKUP/vda.copy.data
[2023-09-06 08:43:17] INFO root virtnbdrestore - verify [MainThread]: Checksum result: 4208209834
[2023-09-06 08:43:17] INFO root virtnbdrestore - verify [MainThread]: Comparing checksum with stored information
[2023-09-06 08:43:17] ERROR root virtnbdrestore - verify [MainThread]: Stored sums do not match: [2609524591]!=[4208209834]