zkemail/archive.prove.email

One-time DNS lookup using 1M domain list and large selector list

Closed this issue · 11 comments

Yush G, [2024-03-26 07:02]
I'm also wondering if this is a good idea to one time DNS lookup on Alexa top million: https://github.com/vavkamil/dkimsc4n/blob/master/dkim_selectors.lst

Yush G, [2024-03-26 07:02]
Then selectors that match can be cached and continue to cron

Yush G, [2024-03-26 07:03]
I'm just worried that we might get selectors that aren't actually used

Yush G, [2024-03-26 07:03]
But that way we can get a ton of domains

Yush G, [2024-03-26 07:03]
We can also expand the list with our selectors maybe

Olof, [2024-03-26 07:05]
might be an idea, it will be 2 billion calls though :) yes you are right, it may also give us many selectors that are not used

Yush G, [2024-03-26 07:11]
Well it'll be a one time call, it can be distributed over a week or whatever lol

Yush G, [2024-03-26 07:12]
I wonder if we should benchmark it with the domains we do have?

Yush G, [2024-03-26 07:12]
And check what the diff would be

Olof, [2024-03-26 07:14]
yep, that should be quite easy,
can also use just a random subset of the domains or the selectors or both, and extrapolate an estimation

ok i have an idea. what if we mark all of these found ones as "?" or something, meaning that we don't know if emails are actively being sent with it. but the ones that we currently have in the database that are added via gmail are fully verified?

Yes, that's a good idea, then we have the option in the future to handle them separately.

I'm doing some tests now. On average I'm finding about 1 new unseen DKIM selector per 1 domain from the 1M-list (using the "majestic_million.csv" list from https://github.com/PeterDaveHello/top-1m-domains?tab=readme-ov-file . The Alexa backup seems to be taken down).
With a speed of about 10 dns lookups per second, the whole thing will take 3 years :)
So it would need some optimization (e.g. narrow down the lists, run parallel jobs).

LMFAO well we definitely don't have time for 3 years. I'd recommend writing in python and using modal.com to distribute it, it should fit under the free plan and only be a few extra lines.

Hehe, yes 3 years is bit long :) But now I have moved to python and modal.com: https://github.com/zkemail/registry.prove.email/blob/onetime-dns/util/dsp_onetime_batch.py

This far, with modal.com, I have managed to speed it up to around 6000 DNS lookups per second, which means the whole process would take 3 days, which is a bit more acceptable.

An estimation of expected number of new domain/selector pairs: I found 8019 pairs for the first 5433 domains from the list. So if the frequency stays about the same for the rest of the list, that would be almost 1.5 million new pairs to the database.

Hehe, yes 3 years is bit long :) But now I have moved to python and modal.com: https://github.com/zkemail/registry.prove.email/blob/onetime-dns/util/dsp_onetime_batch.py

This far, with modal.com, I have managed to speed it up to around 6000 DNS lookups per second, which means the whole process would take 3 days, which is a bit more acceptable.

An estimation of expected number of new domain/selector pairs: I found 8019 pairs for the first 5433 domains from the list. So if the frequency stays about the same for the rest of the list, that would be almost 1.5 million new pairs to the database.

Sounds good. What is the projected cost? If it's more than the $100 of free credit or whatever, let me know and I'll give you the keys to my account.

Sounds good. What is the projected cost?

I tried a sparse run, with every 1000th domain in the list, and that consumed $0.36 of the credits, which gives a projected cost of $360 for the full list, so it's not super cheap. But it might be possible to lower it further, I'll keep working on it.

If it's more than the $100 of free credit or whatever, let me know and I'll give you the keys to my account.

I got $30 of free credits when signing up, so yes we probably need more credits.

Any opinion on which domains list we should use? Right now I'm using the "Majestic" list from https://github.com/PeterDaveHello/top-1m-domains?tab=readme-ov-file

And an update on the estimated number of new domain/selector pairs: 0.5 million is a better estimation, based on running with every 1000th domain, (The 1.5 million was from when just running from the top of the list).

found this one with some common selectors and selector name patterns
https://github.com/ryancdotorg/dkimscan/blob/master/dkimscan.pl#L310

@Divide-By-0 Here are some statistics

There are 2686 unique selectors in our database. Here they are listed by frequency: selector_frequencies.txt

The overlap between our unique selectors and dkim_selectors.lst is quite small, only 112 selectors in common

My suggestion is to merge the following three:

into one static file, that we use for the batch job. This file will be about ~2300 selectors (15% more than dkim_selectors.lst).
I tried this merged file with a sparse run, and got about 12% more domain/selector pairs than with only dkim_selectors.lst, and more than double the amount of unique selectors in the result, which means that the merged list is much better at finding also some rare selectors.

Some more statistics if you're interested:
selector_statistics_summary.txt
unique_selectors_excluding_hashes.txt (filtered out all the selectors that look similar to vw3ejzzpua3uqxtc6dcglg6uraqtxxdd)
intersection between dkim_selectors.lst and unique_selectors_excluding_hashes.txt

tools added here: 50f18c2