ulif/diceware

make the EFF word list the default?

anarcat opened this issue · 22 comments

at first glance, it sure looks like the EFF word list is superior to the other alternatives. is there any reason why it's not the default list?

ulif commented

The other lists were included before the EFF list. The current default, securedrop was provided by the securedrop project and they manually crafted their list for diceware. Therefore I feel a bit indebted to them.

But in the long run the EFF list would certainly make a better default. I will talk to the securedrop people if they can live with it.

In fact I just waited for someone else to ask for EFF as default, so, thanks for this!

Oh hey @ulif. I'm get guy who made that list for SecureDrop. SD moved away from the list I made to the EFF's list with this pr: freedomofpress/securedrop#1452. On those grounds (among others), I'd say it's probably worth it to use the EFF's list over the one I made.

ulif commented

Hey @heartsucker , great to hear from you! Thank you for the feedback! If you can live with it, I would say we switch to the EFF list as default as well.

FWIW I started another little project to check existing wordlists for diceware related flaws. Yet it is not really usable:

https://github.com/ulif/diceware-list

This Python project should generate two CLI scripts: a wordlist creator (diceware-list) and a wordlist checker (wlflakes). Yet not really usable but maybe helpful in the long run.

@ulif Yeah, by all means. Make the switch.

ulif commented

Ooh, just spotted: the EFF list is len 7776, not 8192. Nice for dice, bad for computers. That changes the game. IIRC there are no 2^n list available from EFF. Hm, have to reconsider.

Well either way, don't let any indebtedness influence your decision. :)

ulif commented

Thank you very much, @heartsucker , well, I feel indebted to you anyway.

hmm... this seems like a math problem we should be able to solve now... ie. how to map a X bit entropy source to a Y bit space... why can't we just mod(7776) the entropy source?

@anarcat I think just using mod would change the distribution (because the lower numbers would get the ones above 7776 also, changing the probabilities, right?)
I don't trust my math for security issues, though....

@ulif You could make the EFF list default when using real dice, and keep the previous one for urandom. That would barely increase complexity and be much safer, right?

@ulif @anarcat

Just an FYI, there is less than 0.5 bits of entropy difference between a list of size 7776 and 8192 at 6 words and less than 1 at 8 words.

>>> import math
>>> math.log(7776**6, 2)
77.54887502163469
>>> math.log(8192**6, 2)
78.0000000000000

While @ulif is right that 8192 is better, it is only just barely better for human use. The actual SecureDrop list is even shorter that 7776 because we removed "scary" words from the list in this PR: freedomofpress/securedrop#1546. Even this was considered to be ok since the EFF list was considered more memorizable and still secure.

@anarcat I think just using mod would change the distribution (because the lower numbers would get the ones above 7776 also, changing the probabilities, right?)
I don't trust my math for security issues, though....

well, when i mean modulo, i really mean to discard the reminder of a modulo division, or whatever... of course if you shuffle those discarded numbers back at the top you screw up the distribution, but if you discard the upper (say) 9 bits, you end up with a subset of the list that is 7680 words long and should be uniform.

but i'm not a certified cryptographer either (and certainly not a stats geek :p) so take that with a grain of salt.

ulif commented

@anarcat, I think @clawoflight is right. Oh, and you are right too, I think, if you discard bits. In fact that means you should simply discard values from random generators which are out of your scope (here: 1..7776).

As random generators will normally give you 1 or 2 or N bits, you can create numbers in the scope 0..2^N-1. But I think you cannot "map" or "move" "unused" bits to the next number, "reuse" them or similar (without making the result "unfair" or less equally distributed). I must confess, @anarcat, I did not get how you got the number 7680 (but I am neither a certified cryptographer).

The current implementation, however, copes with shorter wordlists already. So, you can already use the EFF list with "system" random. I should only check, whether this is done correctly (i.e. no mod(), instead discard out-of-scope values).

Also @heartsucker is right: the entropy difference is not too much, about 0.0752 bits per word. It is just: I simply hate wasting entropy ;-)

Thank you very much @anarcat, @clawoflight, and @heartsucker ! That was really helpful :D I actually tend to make EFF the default list and will close this issue with a respective commit later today.

here are my numbers:

  • securedrop wordlist length: 2^13 = 8192
  • original diceware word list length: 7776
  • 8 bits: 2^8 = 256
  • 9 bits: 2^9 = 512
  • stripping upper 8 bits: (2^13) - (2^8) = 7936
  • stripping upper 9 bits: (2^13) - (2^9) = 7680

Stripping 8 bits still gives us too many combinations. Stripping 9 bits, not enough. But we just need to reduce the length of the list to 7680, ie. remove 96 words, and we are spot on.

Hopefully someone will confirm my math :)

In the source code here, the call is delegated to SystemRandom and we're calling it via rnd.choice(some_list). This will give a uniform random distribution for any size list. There doesn't need to be any mucking about with doing modular arithmetic or discarding values.

right, that was my impression as well - i would also assume this is secure, but i'd be curious to see if the distribution is actually uniform...

The docs here say that SystemRandom uses os.urandom() which is crytpographically secure. This means that it is uniformly distributed. If we're at the point where we are asking if /dev/urandom is secure or not for random numbers, we are way outside the scope of the original question which is "give a list of N elements, how can we select M in a cryptographically secure way?"

The question is not whether SystemRandom is secure, but whether Random.choice() maintains a sufficiently uniform distribution for cryptographic purposes.

I think it's safe to say that the devs of the python core library have ensured that SystemRandom does a secure job of randomly selecting elements from a list. Straying from this implementation would get into the very dangerous territory of "rolling your own crypto."

well, here's the source in any case... random.choice calls random._randbelow and, if i read this right, this ends up calling SystemRandom.getrandbits() which does the right thing:

    def getrandbits(self, k):
        """getrandbits(k) -> x.  Generates an int with k random bits."""
        if k <= 0:
            raise ValueError('number of bits must be greater than zero')
        if k != int(k):
            raise TypeError('number of bits should be an integer')
        numbytes = (k + 7) // 8                       # bits / 8 and rounded up
        x = int.from_bytes(_urandom(numbytes), 'big')
        return x >> (numbytes * 8 - k)                # trim excess bits

I'm not absolutely certain randbelow yields a uniform distribution, however:

    def _randbelow(self, n, int=int, maxsize=1<<BPF, type=type,
                   Method=_MethodType, BuiltinMethod=_BuiltinMethodType):
        "Return a random int in the range [0,n).  Raises ValueError if n==0."

        random = self.random
        getrandbits = self.getrandbits
        # Only call self.getrandbits if the original random() builtin method
        # has not been overridden or if a new getrandbits() was supplied.
        if type(random) is BuiltinMethod or type(getrandbits) is Method:
            k = n.bit_length()  # don't use (n-1) here because n can be 1
            r = getrandbits(k)          # 0 <= r < 2**k
            while r >= n:
                r = getrandbits(k)
            return r

is that retry loop (while >= n) correct? i guess it just discards stuff that's too large so that should be okay...

then i'll let you wonder if this 3.6 implementation was the same in Python 2.7 (or earlier!!). ;)

ulif commented

Wow, this is escalating... in a nice way :)

Overall it looks like SystemRandom.choice() works correctly for our purposes, right? Certainly both parts are important: SystemRandom should deliver nearly-unpredictable bits, which choice() should distribute equally over the words of our lists. This seems indeed to be the case. Thank you @clawoflight and @heartsucker for your comments!

Thank you, @anarcat for looking into the sources! One should really do that from time to time.

The py2.7 sources BTW, have a different choice() implementation:

    def choice(self, seq):
        """Choose a random element from a non-empty sequence."""
        return seq[int(self.random() * len(seq))]  # raises IndexError if seq is empty

Here self.random() is a C function if I understand it correctly, that gives a float in range [0, 1]. If that works correctly, then the 2.7 implementation should work correctly as well. But as I have my problems with float representations in C (well, frankly, even in Python), I cannot judge the correctness or fitness of the function for our purposes.

At least I could not find critical flaws when generating respective numbers for some time tonight with the py27 choice() method. Of course that might be examined better with dieharder or similar tools, but for now I am satisfied with my little ad-hoc tests.

Overall, I am glad they look okay for our purposes. Yes, @heartsucker, rolling own crypto would be really dangerous.

Thank you @heartsucker, @clawoflight, and @anarcat for your valuable input! I will make the EFF list now the default.

i wouldn't bet my life on "self.random()", it sure looks like it's taking stuff from a (python) internal RNG, and not from the system 'getrandom' CRNG.... but i have invested too much time in this already... i do strongly encourage you to test choice for flaws, especially in 2.7.

ulif commented

Thanks @anarcat for your time! I learned new tricks, we all got a new default wordlist and I looked into the C-code of Python random implementations for the first time :)

I was wrong with the C implementation of random() in py2.7. In fact we use SystemRandom which provides an own Python method for random() in py2.7:

    def random(self):
        """Get the next random number in the range [0.0, 1.0)."""
        return (long(_hexlify(_urandom(7)), 16) >> 3) * RECIP_BPF

This one apparently gets random bits from /dev/urandom. I am not sure about the distribution, though.