'Choose' with non uniform distribution
rserizel opened this issue · 6 comments
Hi,
Would it be possible to add a support for 'choose' with a non-uniform distribution? For example by passing together with the list of options, a list of probabilities?
That would be great!
@pseeth let's jam this out super quick.
Starting with the API, I'm thinking we could either
- update "choose" so it can optionally accept a series of probabilities
- create a new distribution tuple called "chooseweighted" or something like that
I think option 1 will complicate validation a little, but will be more elegant compared to having two versions of choose. It will sort of match the api of: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html, which we can use to implement it under the hood.
In terms of validation, the tuple would look something like ("choose", items, probs)
, where len(items)
must match len(probs)
, and item[i]
has probability[i]
(that is, the matching between items and probs is implicit based on list index).
Finally, we probably need to check/enforce that probs is a vector of probabilities, i.e. 0 <= probs[n] <= 1 and probs.sum() == 1.
Let me know your thoughts, once we finalize the API we can get started on the implementation.
I think perhaps a new chooseweighted
could be cleanest, actually. Looking at the implementation of choose
, it looks we actually use randint
to decide which label to apply. We also do a list -> set -> list
conversion to remove duplicates. This logic is used both in the "choose label" as well as the "choose source file" code paths. I believe randint
is used, not choice
here:
Lines 245 to 271 in 8cbf94b
This might be for efficiency reasons, as doing choice
on a million long list might be a big performance regression. I think we should make a new distribution for this.
Per offline discussion with @pseeth , we have agreed to keep choose
and choose_weighted
separate - this will make implementation and validation easier and cleaner. We'll use numpy.random.choice
to implement the new feature.
I'll work on the implementation and ping @pseeth for CR as soon as it's ready.
@rserizel great news, we've implemented choose_weighted
(#144), which is now available of the latest version of scaper, v1.6.5rc0.
To install this version you can call: pip install -U scaper==1.6.5rc0
The implementation should be solid, but let us know how if you run into any issues and we'll address them expediently.
Cheers!
Works great!
Thanks a lot @justinsalamon and @pseeth !
@rserizel FYI we just release the official 1.6.5, in case you want to update your dependencies from scaper==1.6.5rc0
to scaper>=1.6.5
. Cheers!