How does limiting the constraint generation work?
Closed this issue · 9 comments
Description
I am not sure how the constraint creation works. Indeed if I limit the number of constraints, will the Supervised class remove examples from similar pairs, from negative pairs or will it arbitrarily cut the part of the data that comes after num_constraints?
The pairs generation process is as follows: we first sample one point x from the dataset X, then we sample another point of X from the same class as x (with the same y) (for sampling similar pairs) or from a different class (for negative pairs), and repeat until we have reached the number of constraints needed. Note that for now, for Supervised classes we sample n_constraints
positive pairs, and n_constraints
negative pairs.
Thanks for your reply. So, now the upper limit for the number of constraints is:
2 * min(n_positive_examples, n_negative_examples).
with the number of positive and negative examples being equal?
In fact I didn't say it but the _pairs
function ensures that no duplicated pairs (pair with same order) is returned. It does this by not adding duplicated pairs to the result, doing at most max_iter
passes through X to try to find n_constrained
not duplicated pairs. But if even after max_iter
passes, there are no n_constraints
pairs to return, it will return them with a warning.
In this case, the "same length" argument allows to force positive_negative_pairs
to return the same number of samples.
So to sum up if you see no warnings thrown, Constraints.positive_negative_pairs
has returned n_constraints
positive pairs and n_constraints
negative pairs.
But if a warning is thrown, either you have set the flag same_length=True
and the method will return min(positive_pairs_built, negative_pairs_built)
positive pairs and min(positive_pairs_built, negative_pairs_built)
negative pairs , or if same_length=False
the method has returned positive_pairs_built
positive pairs and negative_pairs_built
negative pairs, with positive_pairs_built
and negative_pairs_built
potentially different
I agree that for now this is not very well documented, and pairs construction is something we will definitely try to simplify and improve later on
Thanks for your clarifications. I am using the MMC_Supervised class at this point and I do not believe there is a way to set the same_length argument, is there (it is False by default)?
Indeed, MMC_Supervised
calls Constraints.positive_negative_pairs
with its default argument: same_length=False
, so there is no way to set it to True
from the MMC_Supervised
interface for now.
Does MMC_Supervised
throw you a warning though ? Because if not, this means that the number of positive pairs and negative pairs built are the same and equal to n_constraints
.
Yes, I am using the class with different datasets in a loop. I put the num_constraints to be the maximum I can handle given the amount of RAM I have. For larger datasets, this is not a problem but for some smaller datasets, it is throwing a warning.
I see, it makes sense indeed, since for small datasets the algorithm cannot create a too big number of constraints without duplicates... If you want to have the same number of constraints for negative and positive constraints, I guess for now maybe you could modify the default with something like this, overwriting the method:
import copy
new_func_bis = copy.copy(Constraints.positive_negative_pairs)
def new_func(self, num_constraints, random_state=np.random):
return new_func_bis(self, num_constraints=num_constraints, same_length=True, random_state=random_state)
Constraints.positive_negative_pairs = new_func
But this is kind of hacky..., or you could fork/clone the repo and change the default value of Constraints.positive_negative_pairs
to True
Let's let this issue open to remember that in the future it could be good to allow setting same_length=True
when creating a metric learner.
Thanks for exposing the alternatives. I feel that, in the long run, forking the repo will be the most sensible thing to do.