question about vdo write behavior
zhanglx2018 opened this issue · 5 comments
hi, Gurus
I have a question for the dedup write behavior:
from this article , it says :
" - Once the acknowledgment is complete, an attempt is made to deduplicate the block by computing a MurmurHash-3 signature of the block data, which is sent to the VDO index.
- If the VDO index contains an entry for a block with the same signature, kvdo reads the indicated block and does a byte-by-byte comparison of the two blocks to verify that they are identical.
- If they are indeed identical, then kvdo updates its block map so that the logical block points to the corresponding physical block and releases the allocated physical block."
why kvdo needs a byte-by-byte comparison? is that to say the kvdo will issue a read operation from underlying disk to get the indicated block, thus there is one more read request during dedup operation?
Thank you.
the hash is not collision free, ie it could happen that you have two blocks that have the same hash but do not contain the same data. That is why you need to actually compare the data after you found a possible match in hash metadata.
Hello @zhanglx2018,
@Klaas- is correct. MurmurHash-3 is not a cryptographic hash, and therefore should not be trusted to guarantee that the same hash means that the data contains the same information.
You are also correct in the understanding the added read operations. For each block that is written, VDO will read the suggested block from the index to confirm whether it is actually a duplicate. When a block does not match the advice provided by the index, the "dedupe advice stale" counter increases.
There are situations where the additional reads are not performed, such as when there are large amounts of duplicate blocks in flight. I will try to explain this, though I may be mangling some of the details. In high concurrent deduplication situations, VDO will read the physical block once and hold on to a "physical block lock" for that block until all concurrent writes that received advice pointing to that same block have been completed. Note that the "physical block lock" could be released at some point when there are no outstanding IO's referring to that physical block, which would require the block to be read once again.
Hopefully this helps. If you need further clarification, please feel free to ask.
@rhawalsh but even with a cryptographic hash you'd need to check the real data, right? Hashes in general have collisions because that's what a hash does -- otherwise it would be compressing and not hashing. Cryptographic in the hash sense just means they are hard to reverse if I remember CS classes correctly.