If the registry_cache files are corrupted overlaybd fails to load.
simha-db opened this issue · 8 comments
What happened in your environment?
Looks like if the registry_cache has some corrupt files overlaybd fails to create image.
logs below
2023/07/13 07:02:37|ERROR|th=00007FE4F37F9F80|/src/src/overlaybd/lsmt/file.cpp:1159|verify_ht:header magic/type don't match
2023/07/13 07:02:37|ERROR|th=00007FE4F37F9F80|/src/src/overlaybd/lsmt/file.cpp:1553|do_parallel_load_index:failed to load index from 32-th file
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/overlaybd/lsmt/file.cpp:1581|load_merge_index:load index failed.
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/image_file.cpp:346|open_lowers:LSMT::open_files_ro(files, 76, 1) return NULL
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/image_file.cpp:470|init_image_file:open lower layer failed.
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/src/main.cpp:302|dev_open:create image file failed
2023/07/13 07:02:37|ERROR|th=00007FE5D1B53B00|/src/build/_deps/tcmu-src/libtcmu.cpp:605|device_add:handler open failed for uio3
Looks like it would be good to have logic to delete the file and retry?
What did you expect to happen?
Recover by deleting the file and redownloading.
How can we reproduce it?
- Start a container on overlaybd.
- Corrupt the registry_cache files by writing junk to it
- observe the container doesn't start anymore.
What is the version of your Overlaybd?
0.5.3-1.
What is your OS environment?
Ubuntu 20.04
Are you willing to submit PRs to fix it?
- Yes, I am willing to fix it.
@simha-db By now the only the zfile level has implemented a mechanism to evict corrupted data. One is on opening a zfile, if error occurs loading zfile jumptable, all file will be evicted. The other is when crc verification fails, the failed block will be evicted. Of course, that doesn't cover every scenario, any enhancements and fixes are welcome.
Ah ok - so the validation done is the same as the what i would get if i run
overlaybd-zfile --verify -t -x layerfile
?
@liulanzheng @BigVan if any block of the Overlaybd blob is corrupted, will overlaybd-zfile --verify -t -x layerfile
be able to identify the corruption?
Also @BigVan do we know how fast the the validation of overlaybd-zfile --verify
, is it faster than CRC32?
Also @BigVan do we know how fast the the validation of
overlaybd-zfile --verify
, is it faster than CRC32?
No, it just read the whole file and check the crc32 for each compressed block.
@liulanzheng @BigVan if any block of the Overlaybd blob is corrupted, will
overlaybd-zfile --verify -t -x layerfile
be able to identify the corruption?
'overlaybd-zfile' will print the block_id which mismatch its checksum, like:
2023/07/25 15:27:09|ERROR|th=0000562F26E92DE0|/root/work/dadi/overlaybd/src/overlaybd/zfile/zfile.cpp:934|zfile_validation_check:crc check error in block 132240
and exit the program with non-zero code.
Thanks @BigVan for the quick reply. Do we know how much time the verification would take for a 10GB (or some other layer size) to verify? Also is it possible to verify these blocks with multiple threads?
@BigVan could you clarify the following scenarios?
Some background context: we download Overlaybd image blob layers by chunks concurrently and put them into registry_cache.
- If some part of a downloaded layer blob is corrupted, but by the time process start the corrupted data is not required, will the process still be able to start up correctly?
- Assuming 1. is true, during application execution, if process reads the corrupted data, will this cause unexpected application errors?
- If we call zfile verify after a blob is downloaded, will it be able to detect the corruption and return error?
Our goal is to ensure that we could call the zfile verify to identify any corruption in blob and fail at the container create time instead of causing unexpected application error.
- the corrupted data will not affect the container startup.
- overlaybd will try to evict the corrupted chunk and download this part of data from registry
- yes. as i mentioned before, overlaybd-zfile --verify will print the corrupted block data id and exit a non-zero code.
#239 (comment)