Count test on uniformity of occurrence in uuid (hex)

Question

Count test on uniformity of occurrence in uuid (hex)

emclab opened this issue 3 years ago · 3 comments

Is your feature request related to a problem? Please describe.

Here reported is a test result. The purpose of count test is to see if uuid alg favors/dislike certain letter or number by counting the occurrence of each letter/number in uuid (8.3.2. hex) generated. There are total of 10000 uuids generated and counted. Here is the code in reactjs (17.0.2)/nodejs 14.17.0 for the test:

import { v4 as uuidv4 } from 'uuid';
let round = 10000, temp1, objuuid={};
for (let i=0;i<round;i++) {
            console.log("ith round : ", i);
            temp1 = uuidv4(); //uuid generated
            //uuid
            for (let j=0;j<temp1.length;j++) {
                let a1 = temp1.charAt(j);
                if (a1 !== "-") {. //don't count "-"
                    let hit = false;
                    Object.keys(objuuid).map(k => {
                        if (k == a1) {
                            hit = true;
                            //break;
                        }
                    });
                    if (hit) {
                        objuuid[a1] = objuuid[a1] + 1;
                    } else {
                        objuuid[a1] = 1;
                    }
    
                }
            }
        };

Here is output of objuuid in 3 run:

1st one:

0: 18934
1: 18723
2: 18789
3: 18712
4: 28840
5: 18491
6: 18628
7: 18732
8: 21498
9: 21241
a: 21238
b: 21223
c: 19003
d: 18820
e: 18492
f: 18636

2nd one:

0: 18768
1: 18837
2: 18572
3: 18945
4: 28557
5: 18908
6: 18922
7: 18734
8: 21265
9: 21044
a: 21274
b: 21344
c: 18681
d: 18604
e: 18775
f: 18770

3rd one:

0: 18676
1: 18743
2: 18712
3: 18741
4: 28722
5: 18947
6: 18747
7: 18736
8: 21316
9: 21253
a: 21111
b: 21413
c: 18795
d: 18704
e: 18737
f: 18647

From the result above (did more than 3 runs), It seems that "4, 8 ,9 , a, b" are more favored than others. Will this "favor" negatively impact the security of the uuid generated? Is the bias more related to PRNG used in reactjs/nodejs (crypto.getRandomBytes?)?

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Answer 1 · 2021-11-10T21:38:56.000Z

This is expected. The 13th character is the version field and will always be "4" for version 4 uuids. The 17th character contains the variant field and will always be "8", "9", "a", or "b". See https://datatracker.ietf.org/doc/html/rfc4122#section-4.1.2 for details.

Answer 2 · 2021-11-10T22:40:47.000Z

Hi broofa, does this make the uuid generated a bit easier to be guessed?

Answer 3 · 2021-11-11T16:45:50.000Z

In theory, yes. 122 bits of randomness is easier to guess than 128 bits.

In practice, there are two things you need to know. The first is that UUIDs are not meant to be unguessable. If you're really concerned about security, use technology designed to actualy be secure. UUIDs aren't that.

The second is that UUIDs are actually pretty hard to guess.

"How hard", you ask? Well, let's try a fun little thought exercise ...

I've just handed you a 1,000TB SSD filled with UUIDs. I.e. More than you are ever likely to generate by many orders of magnitude. How long would it take to guess just one of the UUIDs on this drive?

For your first attempt you figure that's a lot of UUIDs. Surely it can't be that hard, right? So let's just guess one and see if it's on the drive.

Hmm, this is taking a while. Like, a long while. Why is it taking so long? Oh, right, SSDs can only read at a rate of 550MB/second. So just checking a single UUID is going to take... [monkey mashes on calculator]... 21 days. Wow, we really do have a lot of UUIDs here, don't we?

Okay, fuck that approach. This is what databases and db indexes are for. SQLite to the rescue! It can do 5,000 checks/second on indexed data, so let's give that a try. (Ignoring for the moment that SQLite's theoretical max db size caps out at a paltry 281 TB.) So slurp all the UUIDs into the DB, build your index and start guessing!

Hmm... this is still taking awhile. Like, we've been going at this for weeks now and still no luck. Why isn't this working? Is there something wrong? How many guesses is this going to take???

Maybe we should think about this a bit. <grabs pen and paper>. Okay, let's say we want a 1-in-a-million chance of guessing just one of the UUIDs out of this now-annoyingly-large dataset we've been given. That's a pretty modest goal, right? How many guesses are we talking about? Let's see, if we do a a little math we get an answer of... (wait for it)... 10^19.9 guesses.

Wow. That's kind of a lot. What are we talking about with this SQLite setup (that is suddenly looking very, very underpowered for this)? Take that number, divide by 5000 times the number of seconds in a year, carry the one, round to the nearest millisecond and we get... 504M years.

Crap. We would have had to start this around the time multi-cellular life first appeared on earth. Well, at least we know why we haven't been able to guess one of the UUIDs yet. Clearly we need a different setup.

Sigh... okay, what's this gonna take? Google currently does about 63,000 queries/second but we know that's nowhere near enough. Let's throw everything we have at this and get to where we can do our checks in the time it takes to read a single UUID out of RAM, or about 3x10⁹ checks/second. That's probably about as optimistic a goal as we should expect given modern technology.

I'm not sure how you build such a system, by the way. You'd need your dataset of UUIDs kept in-memory of course. You'd want to make it hardware-scalable so you'll need some sort of map-reduce architecture spread across dozens or hundreds of hosts. But your guess-data is going over network interfaces to get to the hosts, and that data is not small. In fact... huh... it's actually enormous. 127 exabytes. That's, like, 1-2 months worth of Internet traffic. Like, all traffic everywhere on the entire Internet.

Good thing AWS doesn't charge for data ingress on EC2 instances. Suckers.

Okay, so maybe this is doable, but keep in mind this isn't a trivial system. We'll be running queries 100,000X faster than Google does search. Even if there's no big gotchas, there's probably a nice little PhD in comp-sci awaiting whoever designs this. But this type of performance is at that bleeding edge where you really can't say how much it'll cost or how long it'll take.

But whatever. "Details, mere details", we shout!

So after spending however much time and $$$ it takes to build our Handy UUID Guessing Engine™ (aka H.U.G.E), what have we accomplished? Remember, our goal is a modest 1-in-a-million chance of finding a match. So how long will it take to process our 10^19.9 guesses?

Answer: 812 years

...

I'm sorry, what was your question? Something about needing 6 more bits of randomness to make this problem harder...?

(My apologies if this comes across as snarky. It's not intended that way. I understand the motivation for your question, and it's a good question to be asking. But the numbers involved here are so large that our intuitive grasp of the "guessability" of things breaks down completely.)