digital-preservation/csv-validator

"No Checksum" different from DROID

EveWright opened this issue · 4 comments

When running the checksum expression (MD5) against DROID outputs produced I have found DROID would generate the checksum ‘d41d8cd98f00b204e9800998ecf8427e’ whereas CSV Validator would generate “No Checksum” and therefore fail. I understand this checksum reflects a zero string, so both DROID and CSV Validator are telling me the same thing but as this is displaying as a fail, I am needing to remove these ‘No Checksum’ files from my CSV file prior to using CSV Validator for integrity checking.

In a sense, this is a good thing, as it highlights ‘blank’ files in our collections that require review but I would find it more straightforward is DROID and CSV Validator displayed the same output for this type of file.

Normally you'd only get no checksum if CSV Validator is looking at a folder rather than a file, I would expect to see the checksum for an empty string returned (the same as DROID) when you have a zero byte file. Do you have any example files you could share?

Hi David, apologies for the significant delay in getting back to you. You can find some example files attached.
Blank Files.zip

Curious, I am seeing the same behaviour, though I'm sure in the past CSV Validator returned the defined checksum for an empty string in these circumstances (I've built such checks into CSV Schema in the past). I wondered if it was somehow down to using MD5 rather than SHA-256 but the same behaviour occurs for a SHA-256 checksum test as well.

This does seem to be a bug, but I suspect it's happening in the underlying library we are using, so may be harder to change the behaviour. In the short term note that you will always get the checksum d41d8cd98f00b204e9800998ecf8427e when MD5 is passed an empty string (see https://en.wikipedia.org/wiki/MD5#MD5_hashes), so rather than deleting the rows from your CSV file you could find a replace d41d8cd98f00b204e9800998ecf8427e with NO CHECKSUM, then at least any other checks you are doing on the metadata for the lines relating to those files would still take place.

That's really helpful thanks David