stingle/stingle-photos-android

File deduplication

Opened this issue · 2 comments

Intent:
We want to deduplicate files. It means detect if the same file (image or video) is already present in the library and not import it again to save space and time. Furthermore we want deduplication to be global for the user. For that we need to sync deduplication data through the Stingle cloud server, so if you import a file from device A it could be deduplicated from device B.

The problem:
The obvious way to organize deduplication is to hash file contents, keep a table of these hashes and sync that table to the cloud. This is a very bad solution, it doesn’t leak contents of a files, but it leaks serious metadata. Someone who has access to the server can easily hash a well known photo or video and search through the database to see who has that file in his/her account.

Solution:
So to solve this we still need to detect that particular file is on your Stingle account and skip it and we need to make it secure and not leak any metadata.

How can we achieve this?
We need to get the user's private key, which is well protected, to be involved in the hashing process. However we can’t use private key directly because auto importing usually happens when the app is locked and the app itself doesn't have access to the private key.

Here are the steps:
Generate a random 256 bit nonce (N) and keep it on the server. This is not have to a secret
Create a file hashing nonce (FN) by doing FN = SHA256(PrivateKey + N) and keep it saved locally, don’t send it to the server! This can be easily regenerated on every login from any new and old devices.
Before importing actual file calculate this:
FileHash = SHA256(SHA256(FileContents) + FN)
Search FileHash in the database and if found don’t import the file, if not found import the file the add this hash to the database and resync database

This construction makes it impossible for the person who has access to the server database to see, search or manipulate the FileHashes database. Basically to search a well known file in the database one has to know the user's private key, thus eliminating any metadata leak.

Just a quick idea, how about we just create a hash for all images and the images with the same hash will have one deleted. This way wedon't have to worry about privacy as well as solve image deduplication

Just a quick idea, how about we just create a hash for all images and the images with the same hash will have one deleted. This way wedon't have to worry about privacy as well as solve image deduplication

No just hashing is not a good idea. It grants someone having access to server's database to hash known images and search database to see who has that images.
Please see updated description of this issue on how we are planning to implement it.