scikit-learn-contrib/metric-learn

How does limiting the constraint generation work?

Closed this issue · 9 comments

Description

I am not sure how the constraint creation works. Indeed if I limit the number of constraints, will the Supervised class remove examples from similar pairs, from negative pairs or will it arbitrarily cut the part of the data that comes after num_constraints?

The pairs generation process is as follows: we first sample one point x from the dataset X, then we sample another point of X from the same class as x (with the same y) (for sampling similar pairs) or from a different class (for negative pairs), and repeat until we have reached the number of constraints needed. Note that for now, for Supervised classes we sample n_constraints positive pairs, and n_constraints negative pairs.

Thanks for your reply. So, now the upper limit for the number of constraints is:
2 * min(n_positive_examples, n_negative_examples). with the number of positive and negative examples being equal?

In fact I didn't say it but the _pairs function ensures that no duplicated pairs (pair with same order) is returned. It does this by not adding duplicated pairs to the result, doing at most max_iter passes through X to try to find n_constrained not duplicated pairs. But if even after max_iter passes, there are no n_constraints pairs to return, it will return them with a warning.

In this case, the "same length" argument allows to force positive_negative_pairs to return the same number of samples.

So to sum up if you see no warnings thrown, Constraints.positive_negative_pairs has returned n_constraints positive pairs and n_constraints negative pairs.
But if a warning is thrown, either you have set the flag same_length=True and the method will return min(positive_pairs_built, negative_pairs_built) positive pairs and min(positive_pairs_built, negative_pairs_built) negative pairs , or if same_length=False the method has returned positive_pairs_built positive pairs and negative_pairs_built negative pairs, with positive_pairs_built and negative_pairs_built potentially different

I agree that for now this is not very well documented, and pairs construction is something we will definitely try to simplify and improve later on

Thanks for your clarifications. I am using the MMC_Supervised class at this point and I do not believe there is a way to set the same_length argument, is there (it is False by default)?

Indeed, MMC_Supervised calls Constraints.positive_negative_pairs with its default argument: same_length=False, so there is no way to set it to True from the MMC_Supervised interface for now.

Does MMC_Supervised throw you a warning though ? Because if not, this means that the number of positive pairs and negative pairs built are the same and equal to n_constraints.

Yes, I am using the class with different datasets in a loop. I put the num_constraints to be the maximum I can handle given the amount of RAM I have. For larger datasets, this is not a problem but for some smaller datasets, it is throwing a warning.

I see, it makes sense indeed, since for small datasets the algorithm cannot create a too big number of constraints without duplicates... If you want to have the same number of constraints for negative and positive constraints, I guess for now maybe you could modify the default with something like this, overwriting the method:

import copy
new_func_bis = copy.copy(Constraints.positive_negative_pairs)
def new_func(self, num_constraints, random_state=np.random):
    return new_func_bis(self, num_constraints=num_constraints, same_length=True, random_state=random_state)
Constraints.positive_negative_pairs = new_func

But this is kind of hacky..., or you could fork/clone the repo and change the default value of Constraints.positive_negative_pairs to True

Let's let this issue open to remember that in the future it could be good to allow setting same_length=True when creating a metric learner.

Thanks for exposing the alternatives. I feel that, in the long run, forking the repo will be the most sensible thing to do.