Recurrence by email address / username
dennisbmoore opened this issue · 6 comments
Did you capture usernames / email addresses in your data set? Can you determine uniqueness or lack thereof by email addresses? For example, what fraction of the passwords associated with a specific username (email address if relevant) are unique, and how does that vary with the number of duplicates of the username (i.e., reuse of passwords vs # of times the username is matched in the data set). Thanks!
Hello!
Yes, i did capture username/email tuples in my data.
It is a great idea, however it is extremely time consuming to do a large-scale analysis on both username and password, because it requires doing a join operation on 1 billion rows.
But it is not as impactful as you might think.
- Average number of times each email was found is 1.889.
- 196.250.369 emails were only found once.
- A few email addresses are responsible of raising the average.
mail.ru@hotmail.com
was the most common email address, found 90549 times, along withgmail.com@hotmail.com
(85k times),password@gmail.com
(38k times),info@yahoo.com
(31k times) and so on.
So, i've decided not to process that metric, because it will be too computationally heavy with minimal impact.
If you disagree, please feel free to write so!
Cheers!
Interesting. For the emails used many thousands of times, I wonder if those should be blacklisted (along with any accounts created using those as secondary accounts) - probably fraud related.
What if you limited it to say accounts which appeared within a smaller range of occurrences - say 10 to 500 times? This could substantially reduce the computational cost and would seem to still provide important information about reuse of passwords
Thanks for doing the important work you do!
I've filtered accounts which have appeared more than once in a dump (just because i dont think a regular user can register with the same email more than once to a website).
If there were 25 (username,password) tuples with same username and password in a single dump, they were only counted as 1.
This had 2 possible outcomes - Either accounts repeating 90k times also shared the password and did not get processed 90k times, or they had random password, and did not influence the most common passwords list.
Interesting point though, these spam accounts appear in all kinds of lists, and they have very natural looking passwords, so i don't think these accounts skewed the statistics other than most common passwords either.
I've been checking passwords from mystery lists frantically, i was really excited there was something to possibly explain that, but it looks like just a fraction of these passwords are from these spam accounts.
i need the commands for this how do i search for passwords