zkemail/archive.prove.email

Idea: Register all failed searches and later guess selectors from a list

Closed this issue ยท 19 comments

When a user searches for a domain which is not present in the database,
then add that domain to a special list.
We can then run a recurring batch job to brute force guess selectors in the same way as #66 and #69

Can we also use the client-side browser to help do this? Or can we do this in real time and show a loading bar to the user and also stream the found results to the front end?

Can we also use the client-side browser to help do this? Or can we do this in real time and show a loading bar to the user and also stream the found results to the front end?

Both may be possible. I don't know if DNS lookup can be done in browser, but I'll investigate and see what options there are.

Best approach here is to keep track of if there's another email with the same selector and run our RSA GCD script on it.

Three ideas in one issue. So I just want to check what parts we should implement. I tried to break it down and provide a rough time estimation below, so we can weigh cost versus benefit.

1:

When a user searches for a domain which is not present in the database, then add that domain to a special list. We can then run a recurring batch job to brute force guess selectors in the same way as.

This is the fastest to implement, but it doesn't add any new UX.
Approx amount of work: few days

2:

Can we also use the client-side browser to help do this? Or can we do this in real time and show a loading bar to the user and also stream the found results to the front end?

It's not possible to do DNS lookup directly from the browser, but as you say, we can do it on server, and show a loading bar to the user, and then the results.
Approx amount of work: >~1 week
(update: created a new issue for this: #90)

3:

Best approach here is to keep track of if there's another email with the same selector and run our RSA GCD script on it.

For this, i think we would need to run the RSA GCD on some dedicated machine. The calculation takes 10-40 seconds, so it's just about too slow to do on-the-fly and return the result to frontend. We would also need to keep the plaintext data of all the email headers on this machine.
Approx amount of work: >~1 week

We would also need to keep the plaintext data of all the email headers on this machine.

I'd prefer not to -- better if we could just keep the domain, selector, timestamp to nearest day, header hash and dkim signature so that we have no personally identifiable information, but we have enough to reconstruct the public key. We should use then the closest timestamp before and after of the same selector in order to run GCD.

We should also do this pairwise search over consecutive emails in a users Gmail when doing the Gmail selector scrape.

on-the-fly

We can maybe use modal again queried from the server, or AWS lambda, to run such one off query jobs.

But the timeline seems fine to me, can u add a time estimate also for adding the pairwise GCD check as an in parallel background task during the Gmail scrapes + as an optional flag during the PST/MBOX scrapes?

Best approach here is to keep track of if there's another email with the same selector and run our RSA GCD script on it.

Hmm, I just realized that I don't understand what you mean by either "another email" or "the same selector" ๐Ÿ˜€
In this case, the user just searches for a domain in the search box, a domain for which we don't have any existing selectors. So there is no email message to check against, and no selector either. I'm very confused!

Yeah sorry this is like, issue part 4 where we do this check during email upload to get old public keys and store the sig data, not during site query.

Yeah sorry this is like, issue part 4 where we do this check during email upload to get old public keys and store the sig data, not during site query.

Okay, but should we still do the original issue? I.e. "Register all failed searches and later guess selectors from a list" ?
Or is this now a completely different task?

Yeah sorry this is like, issue part 4 where we do this check during email upload to get old public keys and store the sig data, not during site query.

Okay, but should we still do the original issue? I.e. "Register all failed searches and later guess selectors from a list" ? Or is this now a completely different task?

Yeah good point, it's a different task. I guess I mention it here since both that and issue 3 need a way of storing old dkim signatures and hashes indefinitely. Can you make a high priority issue for that 4th issue instead then?

Yeah sorry this is like, issue part 4 where we do this check during email upload to get old public keys and store the sig data, not during site query.

Okay, but should we still do the original issue? I.e. "Register all failed searches and later guess selectors from a list" ? Or is this now a completely different task?

Yeah good point, it's a different task. I guess I mention it here since both that and issue 3 need a way of storing old dkim signatures and hashes indefinitely. Can you make a high priority issue for that 4th issue instead then?

Okay thanks for clarifying! but I'm afraid I have still not understood which of the parts 1, 2, 3 (and 4?) that should be done, and which ones should not be done.

Yeah sorry this is like, issue part 4 where we do this check during email upload to get old public keys and store the sig data, not during site query.

Okay, but should we still do the original issue? I.e. "Register all failed searches and later guess selectors from a list" ? Or is this now a completely different task?

Yeah good point, it's a different task. I guess I mention it here since both that and issue 3 need a way of storing old dkim signatures and hashes indefinitely. Can you make a high priority issue for that 4th issue instead then?

Okay thanks for clarifying! but I'm afraid I have still not understood which of the parts 1, 2, 3 (and 4?) that should be done, and which ones should not be done.

Thoughts on axing 1 and 3 and doing 2 and 4? You'll kind of effectively do 1 via 2 via the queuing mechanism, and 3 doesn't make sense to do if we have 4. Thoughts? I think it would be cool just to get a few lines about how you are thinking of implementing 4 before you do it since that one I think has a lot of nuance.

Sure! The more thought I gave this (part 4), the bigger I realized it is ๐Ÿ˜€. I said 1 week, but as you can guess from the steps below, it will take longer. So here comes "a few lines" ๐Ÿ˜€
I think we should break it down into smaller issues, for better time estimation, and to reduce misunderstandings about what was expected. (Same with part 2)

Regarding cost vs benefit: Hard to tell how many new keys we will find. If a user's email message is farly new, then the key will probably be available on DNS anyway. And for other messages, the only clue I have is that on the public key batch job we did, we found 122 new keys from 38k signed messages. Then there may of course be other benefits to this, like gaining knowledge, or for demonstrating a concept. As for the cost, I guess it's the obvious cost of the upfront work, plus maintenance, and perhaps cost for cloud computing the GCD.

But I have an alternative suggestion:
We can do only part A below, which is much smaller. And then we can run the GCD solver from time to time instead (could be manually or started automatically once a day or something). The difference would be that the user won't get any extra feedback during upload, the upload would look just like now. But given that the probability of finding new keys with this method is probably low, I don't think that such feedback would add that much user value.

Anyway, here's a breakdown of what I could think of until now:

Part A:

  • Create a new database table "EmailSignatures", with columns: Domain, Selector, HeaderHash, DkimSignature. (Edit: also need a Timestamp column, as per feedback below) - Done: 350e709
  • Implement/find a tool for canonicalization of email headers during upload. Maybe we have code for this in some of the other zkemail projects.
  • When a user uploads an email via /upload_gmail: Canonicalize header fields and calculate hash. Put the hash in EmailMessages table. Done: 350e709
  • A small modification of the current GCD solver (https://github.com/foolo/sigs2rsa/blob/main/sigs2rsa.py) to simply use those hash signatures instead of calculating them from plaintext. Done: foolo/sigs2rsa@3c1574b
  • Update the privacy policy with new info about data usage. (Do we need a re-verification with Google for that?) Done: 350e709
  • (If we choose to implement only part A: Create a script that runs the GCD solver for new records in EmailSignatures table, and adds any new keys to the db.)

Part B, needed for added user feedback during upload:
(update: moved this to new issue: #91 )

  • Setup a "GCD solver server", that will run on Modal/AWS/etc. Either the web server or the sovler server needs a queuing/blocking mechanism if the capacity of the solver is limited. Needs autorization between the web server and the solver server.
  • If we are to use modal.com, we probably need to modify the GCD solver to move away from Docker/Sagemath (which most likely won't work on Modal), but instead use for example gmpy2, which seems to have about the same performance. Done: foolo/sigs2rsa@f432e78
  • During user email upload, if we don't already have the public key for the domain+selector, and cannot find it via DNS, then search for that domain+selector in EmailSignatures table, and if found, call the GCD solver server with the hash+signature of the newly contributed message and the message from EmailSignatures.
  • (*) Implement the mechanism to send the requests from the web server to the GCD solver server, and then handle the results asynchonously and put in the database.
  • (*) Propagate those results back to the frontend, to display to the user while uploading.

(*) These steps are the ones that are least clear at moment.

And a note about on-the-fly solving and parallelism: We can run many GCD-calculations in parallel, but the GCD method itself is single-threaded, so each GCD call will still take 10-40 seconds, depending on key size.

We should store timestamp as well. Then,

search for that domain+selector in EmailSignatures table, and if found, call the GCD solver server with the hash+signature of the newly contributed message and the message from EmailSignatures.

if many match, we can just do GCD on the signatures immediately backwards and forwards.

Docker works just fine on modal! I recommend giving that a shot as it'll be fastest since you already have that right

if many match, we can just do GCD on the signatures immediately backwards and forwards.

I didn't understand what you meant by this one, (many uploaded emails? many db records? and a bit confused about "immediately backwards and forwards"?)

if many match, we can just do GCD on the signatures immediately backwards and forwards.

I didn't understand what you meant by this one, (many uploaded emails? many db records? and a bit confused about "immediately backwards and forwards"?)

Ah sorry let me clarify. If many other dkim signature-message pairs exist for that domain and selector, if we see a new email then we can simply GCD with the ones with the closest timestamps before and after to compare to reverse engineer possible keys. This is an optimization that can be implemented later, but we need to ensure we are saving timestamp as well now!

if many match, we can just do GCD on the signatures immediately backwards and forwards.

I didn't understand what you meant by this one, (many uploaded emails? many db records? and a bit confused about "immediately backwards and forwards"?)

Ah sorry let me clarify. If many other dkim signature-message pairs exist for that domain and selector, if we see a new email then we can simply GCD with the ones with the closest timestamps before and after to compare to reverse engineer possible keys. This is an optimization that can be implemented later, but we need to ensure we are saving timestamp as well now!

Then I understand, thanks! Yeah, I will add timestamp.

Docker works just fine on modal! I recommend giving that a shot as it'll be fastest since you already have that right

Oh okay. I actually already ported it to gmpy2 a few days ago, so then that was a bit unnecessary work, but luckily it didn't take that long. At least we got rid of a big dependency.

Closing this in favor of partial issues: #90 #91 #92 #93