Paper ----- Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks. W. Melicher, Blase Ur, Sean M. Segreti, Saranga Komanduri, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor. USENIX Security 2016. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/melicher Bugs ---- This is software used and maintained by students for a research project and likely will have many bugs and issues. Setup using Docker ------------------ Make sure you have installed the NVIDIA driver (https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver) and Docker (https://docs.docker.com/install). For GPU support, additionally install nvidia-docker (https://github.com/NVIDIA/nvidia-docker). Build a CPU-only container and start an interactive bash session within it: ./deploy.py build-cpu ./deploy.py run-cpu Build a GPU-supported container and start an interactive bash session within it: ./deploy.py build-gpu ./deploy.py run-gpu Note: You may need to specify python3 when executing python scripts within the Docker container, e.g. `python3 pwd_guess_unit.py`. Setup (Manual) -------------- Requirements: + python3 + python packages: listed in requirements.txt Install with python3 -m pip install -r requirements.txt - requirements-tensorflow-1.4.txt - dependencies for working with tensorflow v1.4 - requirements-tensorflow-cpu-1.4.txt - dependencies for working with tensorflow v1.4 and using CPU only (no GPU). Compiling: python3 setup.py build_ext --inplace Set up: - Cuda must be in path and library path. Look at the tensorflow documentation for how to set it up. Tests ----- Run automated tests by: python pwd_guess_unit.py Running all tests takes roughly 15 minutes on my machine. It may take more depending on the GPU you are using. or to run only specific tests: python -m unittest pwd_guess_unit.<specific unit test> Help ---- python3 pwd_guess.py --help usage: pwd_guess.py [-h] [--pwd-file PWD_FILE [PWD_FILE ...]] [--arch-file ARCH_FILE] [--weight-file WEIGHT_FILE] [--pwd-format {trie,tsv,list,im_trie} [{trie,tsv,list,im_trie} ...]] [--enumerate-ofile ENUMERATE_OFILE] [--retrain] [--config CONFIG] [--args ARGS] [--profile PROFILE] [--log-file LOG_FILE] [--log-level {debug,info,warning,error}] [--version] [--pre-processing-only] [--stats-only] [--config-args CONFIG_ARGS] [--forked {guesser,random_walker}] [--calc-probability-only] [--train-secondary-only] Neural Network with passwords. This program uses a neural network to guess passwords. This happens in two phases, training and enumeration. Either --pwd- file or --enumerate-ofile are required. --pwd-file will give a password file as training data. --enumerate-ofile will guess passwords based on an existing model. Version <version number> optional arguments: -h, --help show this help message and exit --pwd-file PWD_FILE [PWD_FILE ...] Input file name. --arch-file ARCH_FILE Output file for the model architecture. --weight-file WEIGHT_FILE Output file for the weights of the model. --pwd-format {trie,tsv,list,im_trie} [{trie,tsv,list,im_trie} ...] Format of pwd-file input. "list" format is onepassword per line. "tsv" format is tab separated values: first column is the password, second is the frequency in floating hex. "trie" is a custom binary format created by another step of this tool. --enumerate-ofile ENUMERATE_OFILE Enumerate guesses output file --retrain Instead of training a new model, begin training the model in the weight-file and arch-file arguments. --config CONFIG Config file in json. --args ARGS Argument file in json. --profile PROFILE Profile execution and save to the given file. --log-file LOG_FILE --log-level {debug,info,warning,error} --version Print version number and exit --pre-processing-only Only perform the preprocessing step. --stats-only Quit after reading in passwords and saving stats. --config-args CONFIG_ARGS File with both configuration and arguments. --forked {guesser,random_walker} Internal use only. --calc-probability-only Only output password probabilities --train-secondary-only Only train on secondary data. Pretrained Network Usage ------------------------ Enumerating passwords Edit guess_len8_config.json to replace "g1_len8.tsv" in the "enumerate_ofile" key with the output file you would like. If you want to guess more passwords, you should change the value of "lower_probability_threshold" to something lower, e.g. 1e-8. Passwords are not sorted, so if you want in order guessing, then sort the output file by descending probability: sort -gr -k2 -t$'\t' [OUTPUT_FILE] -o [SORTED_OUTPUT_FILE] Monte Carlo Simulation Edit guess_len8_config.json to replace "g1_len8.tsv" in the "enumerate_ofile" key with the output file you would like. Edit "<input_file>" in the "password_test_fname" key to set the password input file. This file should point to a line-delimited password file where each line is one password. Command: python3 <path_to_root>/pwd_guess.py --config-args <config_file.json> e.g.: python3 ../pwd_guess.py --config-args guess_len8_config.json Version ------- python pwd_guess.py --version Output format ------------- delamico_random_walk - This output format performs a monte-carlo estimation of the guess number, the strength of a password. The output file is a TSV where each line has 7 fields: the password, the probability of that password, the estimated output guess number (the strength of the password), the std deviation of the randomized trial for this password (in units of number of guess), the number of measurements for this password, the estimated confidence interval for the guess number (in units of number of guesses). human - This output format enumerates guesses and stores the list of passwords guessed to the output file. The guesses are not in order of probability. The otuput file is a TSV with each line having two fields: the password, and the probability. You can sort the passwords by probability using the unix sort command. calculator - This output format calculates the exact number of guesses for a test set of passwords by enumerating guesses. The output file is a TSV with 3 fields: the password, the probability for that password, and the guess number. generate_random - This output format generates random passwords and stores them to disk. The output is a TSV with 2 fields: the random password and its probability. Config files ------------ Configuration information for guessing and training. Can be read from a file in json format. # Files Configuration Options: intermediate_fname - File name to store intermediate information about processing relative to the current directory. A value of ':memory:' will store all values in memory. Default is ':memory:'. This is necessary if enumeration and training happen at different times. Neural network Model Configuration: char_bag - alphabet of characters over which to guess. By default this includes all keyboard keys (e.g., alphanumeric characters and some symbols). model_type - type of model. Should be LSTM or GRU or JZS{1,2,3} (JZS1,2,3 are only supported in earlier versions of the Keras library). hidden_size - Size of each layer hidden recurrent layer. dense_layers - Number of additional dense layers. dense_hidden_size - Size of dense layer. layers - Number of hidden layers. embedding_layer - Whether to use a character embedding layer as the first layer embedding_size - The size of the character embedding layer. embedding_layer has to be set true along with this option max_len - Maximum length of any password in training data. This can be larger than all passwords in the data and the network may output guesses that are this many characters long. min_len - Minimum length of any password that will be guessed. model_optimizer - Model optimizer. Default is 'adam'. Read about optimzer values from the Keras documentation: http://keras.io/optimizers/. context_length - Number of context characters to use. Lower means less time to train, more could potentially increase accuracy. generations - More generations means it takes longer but is more accurate. Default is 20. dropouts - Use neural network drop out weights. If true, can prevent overfitting. dropout_ratio - Ratio of dropouts. train_backwards - If true, train on passwords backwards: e.g., guessing d from 'rowssap' instead of guessing d from 'passwor'. bidirectional_rnn - Only supported for some versions of Keras. If true, then use a Bidirectional version of the neural network model. deep_model - If true, then train a deeper NN model. Set this to true if you use more than one layer in the 'layers' argument. padding_character - If true, then use a padding character. This should generally be false, but is included for backward compatibility. Models trained before version 275 include a padding character. # Training Configuration Options: freq_format - can be 'hex' or 'decimal'. This defines the format of frequency integers in the training sets. Only applicable when using TSV format for input. secondary_training - If true, use a secondary training set after the primary training set. secondary_train_sets - Json dictionary in this format: "secondary_train_sets" : { "pwd_file" : [ "<pwd_file>" ], "pwd_format" : [ "list" ] } pwd_file is a list of files. pwd_format is a list of formats corresponding to each file. Accepts the same options as the --pwd-format argument. freeze_feature_layers_during_secondary_training - If true, then during secondary training, the feature layers will be frozen. This is useful for avoiding overfitting to the secondary training set, especially if the secondary training set is significantly smaller than the primary set. secondary_training_save_freqs - If true, then use the secondary training set for post-processing frequencies instead of the primary set. training_chunk - Smaller training chunk means less memory consumed on the GPU. Larger value training chunk means more GPU memory consumed. Ideally, this value would be as large as possible without running out of memory on the GPU. Potentially, there is a possibility that large values also have lower quality training but I have not observed this to happen in practice. chunk_print_interval - Interval over which to print info to the log. This value is also used to calculate the number of previous batches to calculate the moving average loss for making early stopping decisions train_test_ratio - Ratio of training data to holdout testing data. A value of 20 means using one out of every 20 passwords for holdout testing. These passwords are only used to print accuracy statistics in the log data and for early-quit statistics. The logged accuracy statistics are only for diagnostic and debugging purposes and should not be used in a real test. To perform a real test, you should not give any test-passwords during training. training_accuracy_threshold - If the accuracy is not improving by this amount each generation, then quit. Set to -1 to never quit early. rare_character_optimization - Default false. If you specify a list of characters to treat as rare, then it will model those characters with a rare character. This will increase performance at the expense of accuracy. rare_character_lowest_threshold - Default 20. The characters with the lowest frequency in the training data will be modeled as special characters. This number indicates how many to drop. A value of 20 means treating the 20 least frequent characters in the training set as rare characters. uppercase_character_optimization - Default false. If true, uppercase characters will be treated the same as lower case characters. Uppercase characters will be predicted via post-processing output according to the frequency of uppercase characters in the training data. no_end_word_cache - When rare_character_optimization or upper_case_character_optimization is used, it uses different post-processing percents for the first and last character. If no_end_word_cache is true, then only the first character has different post-processing values. The intuition for this is that uppercase characters are likely more probable as the first character and special characters more likely as the last character. simulated_frequency_optimization - Default false. Only for TSV files. If set to true, then multiple instances of the same password are simulated. This can improve performance at the expense of accuracy. save_always - Boolean. Default true. If false, then only the networks which perform best on verification data will be saved to disk. save_model_versioned - Boolean. When saving the model, save each generation of the model using a different file name. You can use this to measure the effect of more generations on models. The first generation is saved as <model_file>.1, the second generation is saved in the file <model_file>.2, where <model_file> is the model file name given in the arguments. randomize_training_order - If true, will randomize the passwords training order. compute_stats - Compute pre-processing step and exit without training a neural network. tokenize_words - If true, create a tokenized model. most_common_token_count - If tokenize_words is true, then this is the number of tokens to simulate. E.g., 2000 will simulate the most common 2000 tokens in the training set. tensorboard - Boolean. If true, will create training visualizations with training statistics tensorboard_dir - The directory where the tensorboard data should be saved. Defaults to current working directory. early_stopping - Boolean. If true, will enable early stopping logic to save weights and stop training when the accuracy fails to improve. The training will wait till early_stopping_patience batches for the loss to decrease before it stops the training early_stopping_patience - Integer. The early stopping algorithm will wait till the number of batches specified by this parameter before stopping the training # Guessing Configuration Options: lower_probability_threshold - This controls how many passwords to output during generation. Lower threshold means more passwords. A value of 1e-7 will output all passwords with probability above 1e-7. relevel_not_matching_passwords - If true, then passwords that do not match the filter policy will have their probability equal to zero and that probability will be redistributed to other passwords. Recommended true. guess_serialization_method - Default is 'human' which enumerates all passwords above the lower_probability_threshold cutoff. 'delamico_random_walk' means calculate password guess numbers using Monte Carlo simulations. 'generate_random' means generate random passwords. 'calculator' enumerates all passwords, but does not save the enumerated passwords to disk; instead it calculates the guess number of the test set of passwords. parallel_guessing - Boolean. If true, then use multiple cores to generate passwords. fork_length - The prefix length to fork on when parallel_guessing is true. If this value is 2, then prefixes of length 2 will be assigned to different cores. For example, one core will generate passwords that start with 'aa', another with 'ab', etc. guesser_intermediate_directory - Directory to store intermediate files used in parallel guessing. cleanup_guesser_files - If true, then delete files in the guesser_intermediate_directory after completion. password_test_fname - File name containing test passwords. Each password should be on one line. chunk_size_guesser - Number of passwords to send to the GPU in one chunk. More increases performance but could run out GPU of memory. max_gpu_prediction_size - Maximum number of password fragments to send to the GPU in one chunk. More increases performance but could run out GPU of memory. gpu_fork_bias - Ratio to decrease the chunk size when using multiple processes. Parallel guessing takes up more fixed memory on the GPU so can lead to running out of GPU memory more easily. This value controls how much to decrease memory by when forking. cpu_limit - Number of processes to fork when using parallel guessing. tokenize_guessing - If true, and if tokenize_words is true, then perform tokenization during guessing. probability_striation - If non-zero, then instead of enumerating probabilities for specific passwords, instead enumerate the guess numbers at certain probability cutoffs. This is useful for exporting a pre-computation of probability to guess number mapping. prob_striation_step - If probability_striation is true, then it will calculate guess numbers for 10^(j * prob_striation_step) for j in 1..probability_striation. So for example, for prob_striation_step = 1 and probability_striation = 10, it would calculate the guess number at the followoing probabilities: 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, 1e-9, 1e-10. enforced_policy - Will not generate guesses that do not match the policy. Currently supported policies are: 'complex' - requires 8 characters and 4 classes. 'basic' - no requirements '1class8' - requires 8 characters 'basic_long' - requires 16 characters 'complex_lowercase' - requires 8 characters and 3 character classes insensitive to case. 'complex_long' - requires 16 characters and 3 character classes 'complex_long_lowercase' - requires 16 characters and 2 character classes insensitive to case. 'semi_complex' - requires 12 characters and 3 character classes 'semi_complex_lowercase' - requires 12 characters and 2 character classes insensitive to case. '3class12' - Same as semi_complex '2class12_all_lowercase' - Same as semi_complex_lowercase 'one_uppercase' - Requires at least one uppercase character *_lowercase policies mean that they are insensitve to case and case is ignored. These are useful when preparing a train set using the policyfilterer.py utility, but not useful for training or guessing with a neural network. # Monte Carlo Methods Configuration Options: random_walk_seed_num - Number of passwords to keep in main memory in one chunk. More increases memory requirements. random_walk_confidence_bound_z_value - confidence bound coefficeint. This should be correspond to the coefficient for a confidence interval. E.g., 95% means a value of 1.96, 99% means a value of 2.58 [https://en.wikipedia.org/wiki/Confidence_interval]. Default is 1.96. random_walk_confidence_percent - Confidence percent for the random_walk guesser. A value of 5 will mean that the simulation will continue until all passwords have confidence interval less than 5% of the estimated guess number. random_walk_upper_bound - Upper bound on the number of rounds to continue simulation. pwd_list_weights - Weighting to give different training sets. This should be a json dictionary mapping file names to a ratio: "pwd_list_weights" : { "file1" : 1, "file2" : 2 } This will weight passwords in file1 as being twice as important as file2. # Deprecated Configuration Options related to Trie preprocessing. Don't use these: trie_serializer_encoding - default is 'utf8'. trie_serializer_type - 'reg' or 'fuzzy'. trie_implementation - Trie implementation. 'trie' for custom implementation. None for no trie optimization. trie_fname - File name for storing trie. trie_intermediate_storage - File for storing intermediate trie. preprocess_trie_on_disk preprocess_trie_on_disk_buff_size toc_chunk_size use_mmap fuzzy_training_smoothing scheduled_sampling final_schedule_ratio Example Configuration File -------------------------- You can also see the pre_built_networks/ directory for examples of configuration files. Here are some starting configuration files that you should modify to suit your needs. Combined arguments and configuration file for generic training. { "args" : { "arch_file" : "arch.json", "weight_file" : "weight.h5", "log_file" : "train_log.txt", "pwd_file" : [ "[INPUT_FILE]" ], "pwd_format" : [ "list" ] }, "config" : { "training_chunk" : 1000, "training_main_memory_chunk": 10000000, "min_len" : 8, "max_len" : 30, "context_length" : 10, "chunk_print_interval" : 100, "layers" : 2, "hidden_size" : 1000, "generations" : 5, "training_accuracy_threshold" : -1, "train_test_ratio" : 20, "model_type" : "LSTM", "train_backwards" : true, "dense_layers" : 1, "dense_hidden_size" : 512, "secondary_training" : true, "secondary_train_sets" : { "pwd_file" : [ "[SECONDARY_INPUT_OPTIONAL]" ], "pwd_format" : [ "list" ] }, "simulated_frequency_optimization" : false, "randomize_training_order" : true, "uppercase_character_optimization" : true, "rare_character_optimization" : true, "rare_character_optimization_guessing" : true, "parallel_guessing" : false, "chunk_size_guesser" : 40000, "random_walk_seed_num" : 100000, "max_gpu_prediction_size" : 10000, "random_walk_seed_iterations" : 1, "no_end_word_cache" : true, "intermediate_fname" : "intermediate_data.sqlite", "save_model_versioned" : true } } Example config of enumerating passwords: { "args" : { "arch_file" : "arch.json", "weight_file" : "nn_len8.h5", "log_file" : "guess_log.txt", "enumerate_ofile" : "g1_enumerate.tsv" }, "config" : { "training_chunk" : 10000, "min_len" : 8, "max_len" : 30, "context_length" : 10, "chunk_print_interval" : 100, "layers" : 2, "hidden_size" : 1000, "model_type" : "JZS2", "simulated_frequency_optimization" : true, "intermediate_fname" : "intermediate_data.sqlite", "randomize_training_order" : true, "uppercase_character_optimization" : true, "rare_character_optimization" : true, "rare_character_optimization_guessing" : true, "parallel_guessing" : false, "lower_probability_threshold" : 1e-6, "padding_character" : true, "chunk_size_guesser" : 20000, "guess_serialization_method" : "human", "random_walk_seed_num" : 100000, "max_gpu_prediction_size" : 20000, "random_walk_seed_iterations" : 1, "no_end_word_cache" : true } } Combined arguments and configuration file for guessing using Monte Carlo simulations: { "args" : { "arch_file" : "arch.json", "weight_file" : "all_trained.h5.3", "log_file" : "guess_log.txt", "enumerate_ofile": "g3_long.tsv" }, "config" : { "training_chunk" : 1000, "training_main_memory_chunk": 10000000, "min_len" : 16, "max_len" : 30, "context_length" : 10, "chunk_print_interval" : 100, "layers" : 2, "hidden_size" : 1000, "generations" : 3, "training_accuracy_threshold" : -1, "train_test_ratio" : 20, "model_type" : "JZS2", "tokenize_words" : false, "most_common_token_count" : 2000, "bidirectional_rnn" : false, "train_backwards" : true, "dense_layers" : 1, "dense_hidden_size" : 512, "secondary_training" : true, "secondary_train_sets" : { "pwd_file" : [ "../leaks/all_combined_long_v2.txt" ], "pwd_format" : [ "list" ] }, "simulated_frequency_optimization" : false, "randomize_training_order" : true, "uppercase_character_optimization" : true, "rare_character_optimization" : true, "rare_character_optimization_guessing" : true, "parallel_guessing" : false, "lower_probability_threshold" : 1e-7, "chunk_size_guesser" : 40000, "guess_serialization_method" : "delamico_random_walk", "password_test_fname" : "../leaks/basic16.txt", "random_walk_seed_num" : 100000, "max_gpu_prediction_size" : 10000, "random_walk_seed_iterations" : 50, "no_end_word_cache" : true, "intermediate_fname" : "intermediate_data.sqlite", "save_model_versioned" : true } } Example guessing configuration for a complex policy. { "args" : { "arch_file" : "arch.json", "weight_file" : "all_trained_cmplx.h5.3", "log_file" : "guess_log.txt", "enumerate_ofile": "g1_complex.tsv" }, "config" : { "training_chunk" : 1000, "training_main_memory_chunk": 10000000, "min_len" : 8, "max_len" : 30, "context_length" : 10, "chunk_print_interval" : 100, "layers" : 2, "hidden_size" : 1000, "generations" : 3, "training_accuracy_threshold" : -1, "train_test_ratio" : 20, "model_type" : "JZS2", "tokenize_words" : false, "most_common_token_count" : 2000, "enforced_policy" : "complex", "bidirectional_rnn" : false, "train_backwards" : true, "dense_layers" : 1, "dense_hidden_size" : 512, "secondary_training" : true, "secondary_train_sets" : { "pwd_file" : [ "../leaks/all_combined_long_v2.txt" ], "pwd_format" : [ "list" ] }, "simulated_frequency_optimization" : false, "randomize_training_order" : true, "uppercase_character_optimization" : true, "rare_character_optimization" : true, "rare_character_optimization_guessing" : true, "parallel_guessing" : false, "lower_probability_threshold" : 1e-7, "chunk_size_guesser" : 40000, "guess_serialization_method" : "delamico_random_walk", "password_test_fname" : "../leaks/complex/andrew8.txt", "random_walk_seed_num" : 100000, "max_gpu_prediction_size" : 10000, "random_walk_seed_iterations" : 1, "no_end_word_cache" : true, "intermediate_fname" : "intermediate_data.sqlite", "save_model_versioned" : true } }
willstruggle/neural_network_cracking
Code for cracking passwords with neural networks
JavaScriptNOASSERTION