openwall/john-tests

Do we need to add --restore tests?

frank-dittrich opened this issue · 60 comments

Idea: run a test with --max-run-time=1, restore repeatedly until finished, then check if all has been cracked as expected.
I think we need just 2 tests (may be one saltless plus one fast salted format).

Or may be some more, for standalone --mask and for --wordlist [--rules] --mask.

(The --fork can be added as -passthru.)

Definitely. I thought I added an issue similar to this already but perhaps I didn't. This can be made with just one format (any), because we are testing a mode, not a format.

But may be we shouldn't use the fastes format, otherwise, --max-run-time=1 might complete the task without the need to restore, especially if you also use --fork.
So, completing the task without the need to restore should cause a warning.

jfoug commented

I am building this now. I am using the pw-new.dic, with the 50 at tail appended a few times. I did it for md5-crypt. It seems to take long enough to have multiple runs. We may have to add a number to the -restore switch (it is just a boolean now), so that you can specify number of seconds per run. Right now, I do a 10s run. But there may be fast enough machines that 10s does not cause enough retries.

This will be a self contained mode, just like internal. It runs, and the function does not return, it exits the script returning proper error code.

jfoug commented

Can we do this test for core??? I am not sure. I am using --max-run-time which I am 99% sure is a jumbo feature. I am not sure that I could do this for core.

I guess we could thread this somehow (or even just using & to put process in background), and then signal it. Speaking of running that way, that might be a good 2nd test even for jumbo. john likely may behave differently with a signal restart vs a -max-run-time restart. I do not know the restore (or save session) code that well, so I am not 100% sure just what to test.

do you two have any thoughts here? I can probably get a POC done and checked in shortly. I guess once we have it working, we can expand upon it if we need to.

core doesn't have --max-run-time, but Iguess if jumbo passes this test we can safely assume core will pass as well.
You could use ulimit -t 10 ./john ... (and ignore $? when john gets killed for exceeding ulimit), but I don't think it is necessary.

jfoug commented

$? already is 256 (1) when we hit the time limit. When john completes prior to time limit, return is 0. That is my 'current' check. I am also trying to get the ETA lines to output (as a how it is going), but perl is being a pisser for me.

jfoug commented

openwall/john#1859 (comment)

  • wordlist
  • single
  • wordlist - rules (hybrid)
  • mask
  • wordlist - mask (hybrid) Method finds BUGGY behavior in john.
  • wordlist - rules - mask (hybrid)
  • markov
  • markov + mask
  • prince (can't be done unless we do much longer runs)
  • rexgen (No resume, done but commented out)
  • rexgen $0 (wordlist) (only the wordlist resumes)

For the wordlist + other modes, we need to make sure that the test is long enough, and with enough breaks, that multiple rules are tested, along with different stopping points within the word file. So we may want a longer wordlist and fewer rules, and a shorter wordlist with LOTS of rules.

Also, we may want to perform some tests using -fork and restore (with the above modes also)

We should probably use one of the slowest formats for this, as we need to crack every single generated candidate in order to know we didn't miss one at restore. Perhaps some PBKDF2 with 10000 iterations. We should also use at least 10 salts I guess (so we don't miss catching a bug where not all salts got tested after a stop-restore cycle). Or maybe 100 salts and fewer iterations. We could pick any custom number of iterations, targeting a convenient speed.

Isn't the current implementation producing 3886 candidates while only having 1994 hashes to crack? There's a significant risk we pass a test even though several candidates were missed.

Ah, no, it rejects input longer than 15 chars so we only have 1994 candidates too. At least for SIMD builds, it's fine.

jfoug commented

Added wordlist+rules. Using korelogic ^d^d^d rules. I did a shuf | head 4500, then created the wordlist with 38 byte passwords (3200 of them). Then salted-sha1 was used. Takes a couple minutes to run, and I have 10s runtimes. Should be a pretty valid test.

e313cc7

jfoug commented

Having issues running this on my VM (64 bit). Had to remove the 'extra' passwords added, and increase time that the exits happen at. Not sure why ....

Do we have a method that will exit john after X number of candidates are found?

$ ../run/john -w=pw-new.dic -rules=PrependNumNumNum -stdout=39 -v:1
7770000p 0:00:00:01 100.00% (2015-11-04 19:33) 6815Kp/s 999ko}GpppVeLj

You test 7,770,000 candidates against 3,200 hashes? If there are a few missed candidates at resume, how would you be sure they are among these 3,200?

jfoug commented

Well look at it this way. I have a random 'mix' of 3200 hashes created out of those 7M possible data. This is no different than any other run of the TS. There are candidates which are not part of the 'found' set (or actually ANY password attack). All that matters is when things are all said and done, that all 3200 were found.

jfoug commented

how would you list to 'test' this ? I guess we could do a super slow hash, with a 'packed' list of data.

jfoug commented

If you can come up with test data, and a set of john tests runs (well, just 1 "run', and then multiple -rest=xxx runs), I will certainly put it into the TS.

The only safe way to test it is to have a hash list that need all the generated candidates in order to fully crack. I believe the first (wordlist only) test is fine as-is but this last one has only a 1/2500 chance of catching a missed candidate. We have to be sure any one missed candidate is catched or this will be pretty much useless.

Maybe we could simply use the default wordlist: ../run/john -w -ru:appendNumNum - that would total about 35000 candidates. So we create salted-sha1 hashes for them, possibly with just eg. 1000 different salts (trying to keep the speed up). We can modulate speed with number of unique salts but in the end, every single candidate should crack exactly one hash. And we probably want to ensure it gets resumed at least 10 times so we may have to aim for ~15 seconds (if ran without --max-run-time) on a very fast system.

If that ends up fine, we can add an almost identical test for wordlist + mask: ../run/john -w -mask:?w?d?d, using the same hash list.

Note that all we need to test here is resume. We don't have to involve any over-length candidates or things like that.

I can experiment a little with it later unless you beat me to it. I'm currently working on truecrypt-opencl.

Do we have a method that will exit john after X number of candidates are found?

Using an external filter, we can abort after X number of candidates are tried. See [List.External:AutoAbort]

jfoug commented

I created a 35k test, and it was done in 20s (35k of salts). I then changed the rules to do this:

$[0-9]$[1-2] and ended up with 70k. This file took 80s, which is 'about' right for doing 8s max run times. It should take 10-12 restarts to get it all (depends upon how much 'replay' there is. Since it is only 20 rules, there may be a 'lot'.

I'd prefer a mask '?w?l?d' or '?w?l?s' over '?w?d?d', because it is more likely that the number of candidates generated per input word is not a multiple of max. keys per crypt (or vice versa).

Hmm the batch size of salted sha1 (avx) for 8 threads is 32K, that's too large for this little test. Maybe we should run it with OMP_NUM_THREADS=1 and --mkpc=4093 or some other prime number. That would mitigate @frank-dittrich's worries too.

Then you have to disallow --mkpc= for -passthru for this specific test, or replace 4093 with the value given on the command line.

jfoug commented

Using salted-sha1 seemed to be 'hard' to get it to work right all the time. It would get in a loop easily, where it would restart far enough back, that it would find no candidates.

So I made a wordlist of 20 'garbage' words, and used appendnumnum rule (100 rules). I then did a pass_gen creation of bitcoin (which has OMP scale of 1), and set the rounds really low (2000).

Now running on a 10s max time, I get about 3-4% gain each 10s loop (shows up using -v)

I think this is a better test. I will get it cleaned up, and checked in.

jfoug commented

--rules=appendNumNum and -mask=?w?d?d work fine with same input file. I will get things checked in. Wasn't mask-hybrid the problem recently? If so, can we make sure the TS fails against the broken version.

NOTE, it is safer to run in the -v mode, so that there is some screen interaction, and so you can watch to make sure the sessions do not get 'locked up'

WOOT WOOT. Shows the bug just fine.

$ ./jtrts.pl -restore
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.13, Dec 21, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper 1.8.0.6-jumbo-1-1562-ga7456a8+ OMP [cygwin 64-bit AVX-ac]
--------------------------------------------------------------------------------

John Jumbo build detected.
Running JTRTS in -restore mode (against a 2000 candidate bitcoin input file -rules=appendNumNum)
  ** NOTE: may take a couple minutes to run

Done with run

Results (should be 2000) : 2000

Running JTRTS in -restore mode (against a 2000 candidate bitcoin input file -mask=?w?d?d)
  ** NOTE: may take a couple minutes to run

Done with run

Results (should be 2000) : 1154
$ ./jtrts.pl -restore
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.13, Dec 21, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper 1.8.0.6-jumbo-1-1562-ga7456a8+ OMP [cygwin 64-bit AVX-ac]
--------------------------------------------------------------------------------

John Jumbo build detected.

Running JTRTS in -restore mode (against a 2000 candidate bitcoin input file )
  ** NOTE: may take a couple minutes to run
Done with run.  Results (should be 2000) : 2000     PASS

Running JTRTS in -restore mode (against a 2000 candidate bitcoin input file -rules=appendNumNum)
  ** NOTE: may take a couple minutes to run
Done with run.  Results (should be 2000) : 2000     PASS

Running JTRTS in -restore mode (against a 2000 candidate bitcoin input file -mask=?w?d?d)
  ** NOTE: may take a couple minutes to run
Done with run.  Results (should be 2000) : 1268     FAIL!

As we can see, the count for the mask resume fluctuates. That is exactly like we would expect.

jfoug commented

546139a

This now has 3 restore types> NOTE the wordlist+mask shows problems in JtR.

👍 perfect!

jfoug commented

I have several other 'modes' done, I just have to make sure I have them right.

I have single, pure-mask, pure-rexgen and hybrid-rexgen 'done'. I only had to make one other dictionary (for single). In that dict, I kicked the iteration count up much higher for bitcoin. I did re-create the base dictionary file. The dictionary is now:

111110 ..... 111129 (20 numbers). And I use -rules=appendnumnum.

So it is easy to use pure mask, and pure rexgen, along with pure single will find them all (the prior dictionary had a ; and a space char in a couple words, and thus single was not as friendly).

Single mode is 2k hashes, with 177864 iterations (bitcoin). It is much slower, but I think the timing is about right. It might be a 'bit' slow, but that should not really hurt anything. This is just a testing tool. Once we get all the various tests in, and running fine, then we should not have to perform these checks very often. Probably would be GOOD to check before each real release.

I wouldn't be surprised if we found some more needs for fixes, minor or not. These new tests are very good Q.C.

jfoug commented

Nice thing, is that they are somewhat static. We get them fixed, and there will infrequelty be new test types added, and the amount of code change really is not that often.

Does -stoponerror work for this? When debugging I'm pretty sure I'll need to see exactly what was missed, and/or run the same tests manually.

jfoug commented

each test will abort if the proper results are not found.

NOTE, for now, use -v The timings are static, and may not be right for specific CPU and or -fork and or OMP settings -v lets you see things as they chip away, so you can make sure the thing is still finding records. If the resume goes backwards TOO far, then it will get stuck, not being able to work through a single buffer/block in the time specified. The bitcoin is pretty minimal. It is OMP_SCALE=1 so it deals with 'few' candidates.

jfoug commented

This is the 'outer' function:

sub doRestoreMode {
    if ($core_only == 1) {
        ScreenOut("John CORE build detected.\n The -max-run-time mode ONLY works for jumbo build of john.\n");
        exit 1;
    }
    `rm -f tst-*`;

    # grow the pw-new.dic file:
    my $cmd = "$JOHN_EXE -rules=appendNumNum --stdout --w=bitcoin_restart_rules_tst.dic > tst-pw-new.dic 2>/dev/null";
    $cmd = `$cmd`;
    doOneRestore("-w=tst-pw-new.dic", "bitcoin_restart_rules_tst.in", 2000, 20, "bitcoin", "");
    unlink("tst-pw-new.dic");

    # now test wordlist + rules.
    doOneRestore("-w=bitcoin_restart_rules_tst.dic", "bitcoin_restart_rules_tst.in", 2000, 20, "bitcoin", "-rules=appendNumNum");
    # now test wordlist + mask.
    doOneRestore("-w=bitcoin_restart_rules_tst.dic", "bitcoin_restart_rules_tst.in", 2000, 20, "bitcoin", "-mask=?w?d?d");
    # now test single mode.
    doOneRestore("", "bitcoin_restart_single_tst.in", 2000, 20, "bitcoin", "");
    # now test pure mode.
    doOneRestore("", "bitcoin_restart_rules_tst.in", 1000, 20, "bitcoin", "-mask=11111?d?d?d");
    doOneRestore("", "bitcoin_restart_rules_tst.in", 1000, 20, "bitcoin", "-mask=11112?d?d?d");
    # now test pure rexgen mode.
    doOneRestore("", "bitcoin_restart_rules_tst.in", 2000, 20, "bitcoin", "-rexgen=1111[1-2][0-9][0-9][0-9]");
    # now test wordlist + rexgen mode.
    doOneRestore("-w=bitcoin_restart_rules_tst.dic", "bitcoin_restart_rules_tst.in", 2000, 20, "bitcoin", "-rexgen=/0[0-9][0-9]");

    exit 0;
}

And here is the worker function that does each test.

sub doOneRestore {
    my ($dic, $hashes, $cnt, $runtime, $form, $exargs) = @_;

    ScreenOutSemi("\nRunning JTRTS in -restore mode (against a $cnt candidate $form input file $exargs)\n  ** NOTE: may take a couple minutes to run\n");
    my $cmd = "$JOHN_EXE -ses=tst- $pass_thru $exargs $dic $hashes -pot=tst-.pot -max-run=$runtime -form=$form 2>&1";
    ScreenOutV("Running 1st retore command, command line\n\n$cmd\n\n");
    my $results = `$cmd`;
    my $ret = $?;
    ScreenOutVV("Results of this run are: $results\n return code [".($ret>>8)."]\n\n");
    if ($verbosity > 2) { show_eta($results); }
    $cmd = "$JOHN_EXE -res=tst- 2>&1";
    while ( ($ret>>8) == 1) {
        ScreenOutV("Running continuing retore commands.  Cmd line: $cmd\n");
        $results = `$cmd`;
        $ret = $?;
        `stty echo >/dev/null 2>/dev/null`;
        ScreenOutVV("Results of this run are: $results\n return code [".($ret>>8)."]\n\n");
        if ($verbosity > 2) { show_eta($results); }
    }
    # now compute if we got them all.
    $results = `LC_ALL='C' sort tst-.pot | LC_ALL='C' uniq | LC_ALL='C' wc -l`;
    chomp $results;
    print ("Done with run.  Results (should be $cnt) : $results   ");
    cleanup();

    if ($results != $cnt) {
        print "FAIL!\n";
        exit 1;
    }
    if ($error_cnt+$error_cnt_pot+$ret_val_non_zero_cnt != 0) {
        print "Some error status.\n";
        exit 1;
    }
    print "PASS\n";
    `rm -f tst-*`;
}

s/retore/restore/g

jfoug commented

Damn nit picker ;) That was simply an easter-egg for -v mode, lol

wordlist and wordlist+rules passed just fine here. And hybrid mask failed. Will look into it. The fact we can separate the basewords from the digits should mean I can see pretty much in what way it fails.

Oh, but it deleted the pot file...

jfoug commented

I thopught that happened AFTER the exit. I will look.

ahh, there is a call to 'cleanup' in there.

I'll run it manually with a -max-run that does it all in just one resume. Should be pretty clear then.

jfoug commented

-single mode seems to work.
-pure mask seems to work. (only first 1000 checked so far, however).

jfoug commented

I will have to let you get rexgen working. I no longer spend time to get that boat anchor code working. I will check in what I have. I think it is ready to roll.

jfoug commented

Please see if rexgen is setup right. NOTE, you will have to comment out the mask test, to be able to continue the testing (or we could move that test to the end, knowing it fails).

bd0ac12

jfoug commented

b7a79ed puts mask+wordlist to the last test.

Hmm the "single" test is actually running batch mode. We'll see how it goes. I assume it should explicitly run -single, right?

Ah, but -single is the first mode of batch mode. This will work fine except if it actually does miss anything. If it does, it will continue with wordlist and then incremental, and "never" end, lol.

Or are you limiting the number of resumes? Doesn't look so in the code above.

jfoug commented

No, there is no 'limit'. It runs until the return says jtr is done.

So in this case, the -single SHOULD be added. BUT if there are any missed, then it still is a problem, and will run a LONG time.

Fortunately, single 'seems' to resume properly.

Luckily single/batch did pass.

Not sure why you do two pure mask runs. It could be one if using -mask=1111[12]?d?d?d

And oops the rexgen test went into an infinite loop! I need to quit now, will look at this tomorrow.

Made some fixes. Two main problems: It's --regex and not --rexgen. Also, pure rexgen actually doesn't support resuming (it does warn about it) so that is commented out from TS now. But wordlist+rexgen should work (only saving parent mode's state). It doesn't work very good though - I think the problems is it doesn't have enough time to get ahead so it's stuck at the initial 10%. If that's it we could simply bump --max-run-time for that test to 20s but I'm ignoring this problem for now and will try to fix hybrid mask.

We're not looking for "capability rexgen" are we? We need to add that so we don't try to run regex mode when there is none.

You' need a special mode to test --regex, so that just a few combinations per candidate will be generated.
Otherwise, you'll never proceed to the next block of input words.
Plus, keep the run time slow. (For very fast hashes, --max-run-time just 1 or 2, otherwise, the memleak issues will cause trouble.

jfoug commented

If that's it we could simply bump --max-run-time for that test to 20s

That has been my 'solution' so far. Also, if doing non-omp and -passthru=-fork=xx then often the times have to be adjusted also. So in general, the code 'works', but needs tweaked, depending upon the build/CPU etc.

I will deal with rexgen. It's totally busted right now regardless.

Note to self

$ ../run/john -inc:lower -mask:?w?d?d -stdout | ../run/pass_gen.pl bitcoin -loops=2000 -count=2000 > inc_hybrid_tst.in 

@jfoug that's how you did them, right?

bah, it didn't honor the -loops option, I'm fixing that right now.

jfoug commented

No, I made a 20 word dictionary (bitcoin_restart_rules_tst.dic), and then did:

../run/john -w=bitcoin_restart_rules_tst.dic -rules:appendnumnum -stdout | ../run/pass_gen.pl bitcoin -count=2000 > inc_hybrid_tst.in

I then edited pass_gen.pl to first set the loop to be 800 (I think), then set it to be 177k for 'single' mode. Single had to be much slower. If I used the original input file, it completes in less than 1s.

There are a lot of formats which do not honor -loops. That should be fixed for all.

jfoug commented

Loops has been 'normalized' now. There is a getter function. In one case we were even using the wrong variable. --loops handling is now added to all formats (all I think) which have a iterations variable within the hash string. One that is 'missed' is phpass hash. Also, I think scrypt might have something, and possibly a couple others, which have a letter or number that ends up being (1<<x) for iterations. But right now, -loops does work more consistantly.

jfoug commented

Added wordlist+rules+mask. This should probably be enhanced some. NOTE, it 'passed' even though wordlist+mask fails. Now, in this mode, there is only 1 mask character (hence the 'should improve' clause.

We probably should do this as a single word in the wordlist: 11111 and then rules of:

[12][0-9] and mask of [0-9][0-9] so that both of them have at least 2 characters. But I have not done that yet.

jfoug commented
doOneRestore("Wordlist+Rules+Mask", "-w=bitcoin_restart_rules_tst.dic", "bitcoin_restart_rules_tst.in", 2000, 20, "bitcoin", "-rules=appendNum -mask=?w?d");
`echo "1111"> tst-pw-new.dic`;
doOneRestore("Wordlist+Rules+Mask #2", "-w=tst-pw-new.dic", "bitcoin_restart_rules_tst.in", 2000, 20, "bitcoin", "-rules=append12Num -mask=?w?d?d");
unlink("tst-pw-new.dic");

Both tests for -w+-ru+-mask pass. The first is rule to append 1 number and mask to append 1 number to each normal input word. 2nd test is 1 input word. rule does [12][?d] and mask is [?d][?d]

They both 'pass'. But pure Wordlist+mask fails ?!!!?!

We probably should do this as a single word in the wordlist: 11111 and then rules of:

I think we need at least three words in the base wordlist.

I'm working with markov tests. Prince can't be tested, it has far too bug chunks (for performance reasons) so won't store any progress in this short runs. We simply rely on Atom's original resume code.