warner/wireguard-vanity-address

faster search algorithm

warner opened this issue · 2 comments

At RWC this year, we (maybe @gtank or @tarcieri?) talked about a faster approach to finding suitable keypairs. The current scheme performs a large number of independent trials (in parallel on all available CPU cores), where each one picks a random secret key (a scalar), performs the (expensive) scalarmult operation to transform it into a public key (a point), then examines the base64-encoded public key for a match against the desired prefix.

The new scheme would take advantage of the fact that point addition is much much faster than scalar multiplication. The initialization step looks like:

  • set s to a randomly-selected scalar, clamped as usual
  • set scalar_offset to 8 (i.e. the Ed25519 group's cofactor)
  • set p to scalarmult(s, BASEPOINT)
  • set point_offset to scalarmult(scalar_offset, BASEPOINT)

Then each step of the loop looks like:

  • base64-encode p and test against the desired prefix. If it matches, print the result and start over again from initialization
  • else, set s = s + scalar_offset and p = p + point_offset
  • repeat

The scalar addition is done modulo the group order, and the point addition follows the usual rules of point addition.

The speed is bounded by the point addition, and we only do one scalarmult per output keypair. Much much faster. For any sensible scalar (large enough to have wrapped around at least once, which is ensured when the clamping step sets the 2^254 bit, I think), adding point_offset makes the point jump to an entirely new portion of the numberspace, so the prefix won't be very correlated, and we won't lose much yield to correlation. (if our prefix comes from the high-order bits, and the point-addition only changed the low-order bits, then the prefix would change very slowly).

Some details to figure out before implementing this:

  • clamping: scalar_offset must be a multiple of the cofactor for this to work at all (assuming s is a valid+safe private key, s + scalar_offset must also be one)
  • wrapping: if clamp(s + scalar_offset) != s + scalar_offset, then we'll also have problems. We could either test this at each cycle of the loop, or rely upon the fact that the group order is so large (about 2^252) that we'll never be unlucky enough for this to ever happen.
  • we'd reset the process each time we find a match because otherwise using two different keypairs from the same run would be mutually-unsafe. If someone compromised the first machine and learned the private key, they could search linearly outwards from that scalar to find the other one very quickly. This wasn't a problem when randomly selecting a scalar each time, but the speedup provided by using related scalars means we must avoid using any starting point more than once.

This is related to #1 because you could give each worker the starting point p (and they'd all use the same point_offset) and then just ask them to tell you how many steps they took before getting to a suitable pubkey. Then on the leader machine (which is the only one to know the private scalar) you just add their step count to the private scalar to get the resulting private key.

This will require using curve25519-dalek in addition to the DH-centric x25519-dalek (which treats private keys as opaque objects without + defined on them).

The 12-count-scalars branch has a functioning implementation of this. Looks like a 4.8x speedup on all my (fairly modern) machines. The slowest part is the conversion of public-key points from Edwards form to Montgomery form (necessary because x25519, Wireguard's key-agreement algorithm, uses Montgomery-form keys to generate the base64 encoding that we have to examine).

test b1_point_generation              ... bench:         190 ns/iter (+/- 8)
test b2a_point_to_montgomery          ... bench:       3,154 ns/iter (+/- 228)
test b2b_montgomery_to_bytes          ... bench:           0 ns/iter (+/- 0)
test b2c_bytes_to_base64              ... bench:          99 ns/iter (+/- 3)
test b2d_base64_contains              ... bench:          88 ns/iter (+/- 17)
test b2e_total_point_checking         ... bench:       3,391 ns/iter (+/- 187)
test b3_point_generation_and_checking ... bench:       3,678 ns/iter (+/- 172)

Not clean enough to land just yet, but it's looking quite promising.