Composite identifiers

Question

Composite identifiers

kirillt opened this issue 3 years ago · 0 comments

Right now, during indexing ids are assigned to each resource (initially known only by path). Tags, in its turn, are assigned to resources using these ids in order to be path-independent. We can't use random ids (like UUID) since for any resource it's id must be computable on other devices (see content-addressing).

Unfortunately, cryptographic hash functions consume significantly more resources than CRC32, which we are using at the moment due to performance reasons. Using weak hash functions can lead to potential conflicts of identifiers (situation when several resources have the same id). Probability of it is open question, but in practice I suspect that several resources out of 5K+ files collection do conflict by id.

These collisions aren't harmful so far. Even in case of destructive operations being applied to an id which has several resources attached to it, only one of them is deleted (the resource which is visible to the user, see #140). This state of things can change in future.

One of potential solutions would be usage of composite identifiers: a stack of functions, where every next function is used when previous function produces a conflict in some given collection. This way, we could use our CRC32 for fast indexing and, in case of detected conflicts, falling back to a stronger function like SHA256 or MD5 applying it only for particular resources involved in the conflict.

Example:

Resource A has id crc32: 1.
Resource B has id crc32: 2.
Resource C has id crc32: 1.

Now we index A and B additionally with SHA-256.
Resource C isn't needed to re-index here (but after implementation of #23 we might put such a task into the queue).

Result:

Resource A has id crc32: 1, sha256: 17.
Resource B has id crc32: 2, sha256: 5.
Resource C has id crc32: 1.

Maybe we can replace old id with the newer one, but for this we need to update tags storage with newer id.

If such feature is possible, we can go further and also significantly increase indexing speed by using file size as first (the weakest) hash function in the stack and calculating CRC-32 only for resources of the same size.