justinsalamon/scaper

'Choose' with non uniform distribution

rserizel opened this issue · 6 comments

Hi,
Would it be possible to add a support for 'choose' with a non-uniform distribution? For example by passing together with the list of options, a list of probabilities?
That would be great!

@pseeth let's jam this out super quick.

Starting with the API, I'm thinking we could either

  1. update "choose" so it can optionally accept a series of probabilities
  2. create a new distribution tuple called "chooseweighted" or something like that

I think option 1 will complicate validation a little, but will be more elegant compared to having two versions of choose. It will sort of match the api of: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html, which we can use to implement it under the hood.

In terms of validation, the tuple would look something like ("choose", items, probs), where len(items) must match len(probs), and item[i] has probability[i] (that is, the matching between items and probs is implicit based on list index).

Finally, we probably need to check/enforce that probs is a vector of probabilities, i.e. 0 <= probs[n] <= 1 and probs.sum() == 1.

Let me know your thoughts, once we finalize the API we can get started on the implementation.

I think perhaps a new chooseweighted could be cleanest, actually. Looking at the implementation of choose, it looks we actually use randint to decide which label to apply. We also do a list -> set -> list conversion to remove duplicates. This logic is used both in the "choose label" as well as the "choose source file" code paths. I believe randint is used, not choice here:

scaper/scaper/util.py

Lines 245 to 271 in 8cbf94b

def _sample_choose(list_of_options, random_state):
'''
Return a random item from ```list_of_options```, using random_state.
If there are duplicates in ```list_of_options```, we remove them from the
list before sampling an item from the list.
Parameters
----------
list_of_options : list
List of items to choose from.
random_state : mtrand.RandomState
RandomState object used to sample from this distribution.
Returns
-------
value : any
A random item chosen from ```list_of_options```.
'''
new_list_of_options = sorted(list(set(list_of_options)))
if len(new_list_of_options) < len(list_of_options):
warnings.warn(
'Removed duplicates from choose list. List length changed '
'from {:d} to {:d}'.format(len(list_of_options), len(new_list_of_options)),
ScaperWarning)
index = random_state.randint(len(new_list_of_options))
return new_list_of_options[index]

This might be for efficiency reasons, as doing choice on a million long list might be a big performance regression. I think we should make a new distribution for this.

Per offline discussion with @pseeth , we have agreed to keep choose and choose_weighted separate - this will make implementation and validation easier and cleaner. We'll use numpy.random.choice to implement the new feature.

I'll work on the implementation and ping @pseeth for CR as soon as it's ready.

@rserizel great news, we've implemented choose_weighted (#144), which is now available of the latest version of scaper, v1.6.5rc0.

To install this version you can call: pip install -U scaper==1.6.5rc0

The implementation should be solid, but let us know how if you run into any issues and we'll address them expediently.

Cheers!

Works great!
Thanks a lot @justinsalamon and @pseeth !

@rserizel FYI we just release the official 1.6.5, in case you want to update your dependencies from scaper==1.6.5rc0 to scaper>=1.6.5. Cheers!