/3.7-billion-passwords-tools

Tools to manipulate the data behind Collection #1 (and #2–5) - AntiPublic.

Primary LanguagePython

Collections 1, Collections 2-5 and AntiPublic Parser

Command line tools to manipulate the data from those multi-billion passwords collections.

The full processing will take a couple of days and will generate a file structure that can be queried almost in o(1).

$ <query> john.applessed@apple.com
john.applessed@apple.com:toto123

The total number of unique records in the final dataset (Collection 1 to 5 + AntiPublic + Breach Compilation) is around 3.72 billions (3,372,591,561 to be precise).

Set Up

Create a virtual environment and install the package.

virtualenv -p python3 venv
source venv/bin/activate
make install

Extraction

https://superuser.com/questions/1308374/how-to-extract-all-tar-gz-files-present-in-multiple-folders-at-a-time-to-another

find "$1" -name '*.tar.gz' -execdir tar -xzvf '{}' -C extracted \;
find . -name "*.rar" -exec unrar x -o+ {} \;

Processing

Processing the Collection 1 is much faster than the Collections 2-5. The estimates for Collections 2-5 are reported below.

The parsing took around 20 hours on my server (CPU i7-8700K, 32GB of memory). I didn't have a large enough SSD to store all the temporary computations so everything was done on a standard HDD. A faster disk will surely make the processing faster.

The sorting/removing duplicates step took 15 hours in total.

The splitting into the smaller files (this file struct makes every query almost instantaneous) took a couple of hours at most.

In total, expect around 2 days to process the Collections 2-5.

breach parse --path /path/to/extracted --success_file success.txt --failure_file failure.txt --cython_acceleration
rm -rf tmp && mkdir tmp # you need like 750GB in tmp/. By default /tmp/ is not enough for this!
cat success.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > success_sorted.txt
breach split --file success_sorted.txt --out data

Converting the format of BreachCompilation to the new format

The dataset is available here: https://github.com/philipperemy/tensorflow-1.4-billion-password-analysis

It's easy to convert the large BreachCompilation dataset to this format by running those commands.

Expect those commands to take some time (less than a day).

find /path/to/BreachCompilation/ -type f -exec cat {} + > breach_compilation.txt
rm -rf tmp && mkdir tmp # By default /tmp/ is not enough for this!
cat breach_compilation.txt | pv -cN cut | sort -T tmp -u -S 90% --parallel=12 | pv -cN cut > breach_compilation_sorted.txt
breach split --file breach_compilation_sorted.txt --out data_breach_compilation_sorted

From there a simple breach merge will be enough to merge it to the Collections 1 & 2 to 5.

Merging all the datasets together

Run the Collection 1 and Collections 2-5 through the processing step described above.

You will have two directories: /path/to/collections1_data and /path/to/collections2_5_data.

Additionally, if you have the other dataset BreachCompilation, you will have another directory /path/to/data_breach_compilation_sorted, generated by the step above.

The merge is destructive so it's better to create a copy of the output first and then merge each one into the output.

cp -rf /path/to/collections1_data /path/to/big_dataset
breach merge --src /path/to/collections2_5_data --dest /path/to/big_dataset
breach merge --src /path/to/data_breach_compilation_sorted --dest /path/to/big_dataset

Usage

The manual of the command line tool can be fetched by running breach dumphelp.

Usage: cli [OPTIONS] COMMAND [ARGS]...

Options:
  --debug / --no-debug
  --help                Show this message and exit.

Commands:
  chunk     chunk large TXT files into smaller files.
  clean     Cleans a query friendly folder PATH. Move incorrect records and
            sort the files.
  dumphelp
  evaluate  Evaluates some metrics such as precision/recall (e.g. is OLD into
            NEW).
  merge     Merges dataset SRC into dataset DEST.
  parse     Parses an unstructured folder PATH of many files and generates two
            files: SUCCESS_FILE and FAILURE_FILE. All valid email:password
            will go to SUCCESS_FILE.
  sort      Sorts a query friendly folder PATH. Target is itself.
  split     Converts a large FILE to a query friendly folder OUT (e.g. a/b/c).
            Use RESTART_FROM to resume from the i-th line.
  test      Infers passwords of a list of emails defined in FILE with a query
            friendly folder DATASET.

Usage: cli dumphelp [OPTIONS]

Options:
  --help  Show this message and exit.

Usage: cli split [OPTIONS]

Options:
  --file FILE             [required]
  --out DIRECTORY         [required]
  --restart_from INTEGER  [default: 0]
  --help                  Show this message and exit.

Usage: cli chunk [OPTIONS]

Options:
  --path DIRECTORY  [required]
  --size INTEGER    [default: 50]
  --help            Show this message and exit.

Usage: cli sort [OPTIONS]

Options:
  --path DIRECTORY  [required]
  --help            Show this message and exit.

Usage: cli clean [OPTIONS]

Options:
  --path DIRECTORY  [required]
  --help            Show this message and exit.

Usage: cli test [OPTIONS]

Options:
  --file FILE                     [required]
  --dataset [breach_compilation|collections_1|collections_2_5|all]
                                  [required]
  --help                          Show this message and exit.

Usage: cli parse [OPTIONS]

Options:
  --path DIRECTORY                [required]
  --success_file FILE             [required]
  --failure_file FILE             [required]
  --cython_acceleration / --no-cython_acceleration
  --help                          Show this message and exit.

Usage: cli merge [OPTIONS]

Options:
  --src DIRECTORY   [required]
  --dest DIRECTORY  [required]
  --help            Show this message and exit.

Usage: cli evaluate [OPTIONS]

Options:
  --old DIRECTORY  [required]
  --new DIRECTORY  [required]
  --help           Show this message and exit.