Using deep learning and NLP to analyze a large corpus of clear text passwords.
Objectives:
- Train a generative model.
- Understand how people change their passwords over time: hello123 -> h@llo123 -> h@llo!23.
Disclaimer: for research purposes only.
- 1.4 Billion Clear Text Credentials Discovered in a Single Database
- Collection of 1.4 Billion Plain-Text Leaked Passwords Found Circulating Online
- Archive of 1.4 BEEELLION credentials in clear text found in dark web archive
- Forbes
- Download any Torrent client.
- Here is a magnet link you can find on Reddit:
- magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337
- Stay tuned!
Generate the JSON files containing emails <-> list of passwords. Output folder is ~/BreachCompilationAnalysis
.
python3 read.py --breach_compilation_folder ~/BreachCompilation
- Make sure you have enough free memory (8GB should be enough).
- It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
- Uncompressed output is 13G.
Output is of the form:
> less ReducePasswordsOnSimilarEmailsCallback-z-b.json # emails starting with zb.
{
"zb-email1@yahoo.com": [
"pass1",
"pass2"
],
"zb-email2@yahoo.com": [
"pass1",
"pass2",
"pass3"
],
[...]
}