'Choose' with non uniform distribution

Hi,
Would it be possible to add a support for 'choose' with a non-uniform distribution? For example by passing together with the list of options, a list of probabilities?
That would be great!

@pseeth let's jam this out super quick.

Starting with the API, I'm thinking we could either

update "choose" so it can optionally accept a series of probabilities
create a new distribution tuple called "chooseweighted" or something like that

I think option 1 will complicate validation a little, but will be more elegant compared to having two versions of choose. It will sort of match the api of: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html, which we can use to implement it under the hood.

In terms of validation, the tuple would look something like ("choose", items, probs), where len(items) must match len(probs), and item[i] has probability[i] (that is, the matching between items and probs is implicit based on list index).

Finally, we probably need to check/enforce that probs is a vector of probabilities, i.e. 0 <= probs[n] <= 1 and probs.sum() == 1.

Let me know your thoughts, once we finalize the API we can get started on the implementation.

I think perhaps a new chooseweighted could be cleanest, actually. Looking at the implementation of choose, it looks we actually use randint to decide which label to apply. We also do a list -> set -> list conversion to remove duplicates. This logic is used both in the "choose label" as well as the "choose source file" code paths. I believe randint is used, not choice here:

scaper/scaper/util.py

Lines 245 to 271 in 8cbf94b

    
           def _sample_choose(list_of_options, random_state): 
        
               ''' 
        
               Return a random item from ```list_of_options```, using random_state. 
        
               If there are duplicates in ```list_of_options```, we remove them from the  
        
               list before sampling an item from the list. 
        
               Parameters 
        
               ---------- 
        
               list_of_options : list 
        
                   List of items to choose from. 
        
               random_state : mtrand.RandomState 
        
                   RandomState object used to sample from this distribution. 
        
               Returns 
        
               ------- 
        
               value : any 
        
                   A random item chosen from ```list_of_options```. 
        
               ''' 
        
               new_list_of_options = sorted(list(set(list_of_options))) 
        
               if len(new_list_of_options) < len(list_of_options): 
        
                   warnings.warn( 
        
                       'Removed duplicates from choose list. List length changed ' 
        
                       'from {:d} to {:d}'.format(len(list_of_options), len(new_list_of_options)), 
        
                       ScaperWarning) 
        
               index = random_state.randint(len(new_list_of_options)) 
        
               return new_list_of_options[index]

This might be for efficiency reasons, as doing choice on a million long list might be a big performance regression. I think we should make a new distribution for this.

Per offline discussion with @pseeth , we have agreed to keep choose and choose_weighted separate - this will make implementation and validation easier and cleaner. We'll use numpy.random.choice to implement the new feature.

I'll work on the implementation and ping @pseeth for CR as soon as it's ready.

@rserizel great news, we've implemented choose_weighted (#144), which is now available of the latest version of scaper, v1.6.5rc0.

To install this version you can call: pip install -U scaper==1.6.5rc0

The implementation should be solid, but let us know how if you run into any issues and we'll address them expediently.

Cheers!

Works great!
Thanks a lot @justinsalamon and @pseeth !

@rserizel FYI we just release the official 1.6.5, in case you want to update your dependencies from scaper==1.6.5rc0 to scaper>=1.6.5. Cheers!

	def _sample_choose(list_of_options, random_state):
	'''
	Return a random item from ```list_of_options```, using random_state.
	If there are duplicates in ```list_of_options```, we remove them from the
	list before sampling an item from the list.

	Parameters
	----------
	list_of_options : list
	List of items to choose from.
	random_state : mtrand.RandomState
	RandomState object used to sample from this distribution.

	Returns
	-------
	value : any
	A random item chosen from ```list_of_options```.

	'''
	new_list_of_options = sorted(list(set(list_of_options)))
	if len(new_list_of_options) < len(list_of_options):
	warnings.warn(
	'Removed duplicates from choose list. List length changed '
	'from {:d} to {:d}'.format(len(list_of_options), len(new_list_of_options)),
	ScaperWarning)
	index = random_state.randint(len(new_list_of_options))
	return new_list_of_options[index]