Enhacement Possibility: MD5 check before upload for dedupe
Closed this issue · 1 comments
Hi folks,
iRODS does not offer any dedupe facility, however it does allow the lookup of a file by its md5sum.
I'm not sure of how much use it would be, but you could check that the md5 of a file isn't present in the zone you are going to upload it to before you upload it.
~/code/LSF/ansible$ ils -L jc180.5G
jc18 0 wtsiusers 536870912 2020-07-17.11:08 & jc180.5G
aa559b4e3523a6c931f08f4df52d58f2 generic /data/home/jc18/jc180.5G
$ iquest "%s/%s" "SELECT COLL_NAME, DATA_NAME WHERE DATA_CHECKSUM = 'aa559b4e3523a6c931f08f4df52d58f2'"
/Sanger1/home/jc18/jc180.5G
Thanks. We wouldn't be able to rely on that because we're not in control of other people's files and can't stop them deleting a file that our backup system relied on when deciding not to upload something.
We could in theory de-dup everything that a particular ibackup server backs up if we had it store actual file data in a special md5s collection, and then "linked" the user's desired iRODS location to the md5s location via a metadata pointer.
But we have strong push back against this kind of indirection, with a strong desire for the user's desired path to be their actual data that they can simply iget if desired.
So dedup would have to be implemented in iRODS itself, not as a layer added by us.
That said, we're handling hardlinks (only) using an indirection, just to cover that edge case.