getguesstimate/guesstimate-app

Strange behaviour when adding custom distributions

Ghoughpteighbteau opened this issue · 3 comments

I have an example here: https://www.getguesstimate.com/models/12144

A brief description: I'm adding a pair of 6 sided dice together and expecting a correct distribution of results. However Guesstimate seems to be adding them together like parallel arrays

When I change one of the dice to a 7 sided dice, it works as expected.

It seems like there's two behaviors for data sets of the same size and data sets of different size

Hey I decided to have a little look at this and I can see what's going on a little better now. There's actually a pretty important bug here, and I hit it when I was doing estimates on how much ram my servers would need!

Take a look at

[{ text: '3*AK*BA', inputs: {AK: [3,8], BA: [5,8]}}, 2],

Line 14 is saying that if you have [3,8] and [5,8], that you need only 2 samples. It should be 4.

You must take a cross product of the two samples, sampling 3,5 8,5 3,8 and 8,8.

3,5 and 8,8 are not enough. Even worse, if a user (like me) happens to enter in data that has been sorted!! then elements [0] and elements [1] will have an irrelevant correlation, and will totally mess up the sensitivity analysis, and produce way larger variations than you would expect.

In the docs, Guesstimate is saying that you just sample from the data randomly 5000 times like anything else, and that actually is how I worked around this problem, by using pickRandom(). But I think that's what you guys should actually do.

OAGr commented

Apologies for going through this now, and greater apolagies if that was confusing! These bits can be quite tricky, especially because multiple properties fight off against each other.

What's going on is this; there were several cases where organizations had data where the samples all came from the same place, so doing it sequentially was important. They want the correlation. The system guesses that they are sequential by if they have the same length.

There should probably be some other way for this, perhaps a setting or similar. It's definitely unintuitive.

One things to do if you do want randomness here is to use at least one item with a different length. In the case of dice, you could have one with twice the samples. I realize this is messy though.

Guesstimate samples 5000 times in the cases where the data is new, like for custom distributions. For custom data this didn't seem appropriate. It would be nice to later add settings or similar.

The way it chooses how many samples to have is a bit messy you can see the full code here:
https://github.com/getguesstimate/guesstimate-app/blob/2096d6c161acce49d28be7a17e7ca93eff91692c/src/lib/guesstimator/samplers/Simulator.js

It could probably be rethought out a bit.

We generally tried to be conservative when it came to samples. So the full cross product could be pretty large as things progress; we do want to keep it from ever being over 10k samples or so. Maybe it would be reasonable to use the cross product in small cases, and the lowest common multiple in other cases. Would something like that make sense?

Alternatively, we could have some per-model settings, but that could be tricky.

Regarding size, I would expect guestimate to always cross product unless the cross product was larger than 5K, and if guestimate reaches 5K, then it just samples randomly. That would be consistent with guestimate's MO, I would find that behavior unsurprising.

Regarding the intention of correlated values, now that I'm thinking of this as a feature, I can see how useful it could be. However this feature needs to be done intentionally, because as it is now it's a footgun, a footgun that actually blew my foot off no less.

I have an idea as to how you might present and work with a correlated data set in guestimate, if you're interested.