Specifying multiple query strategies

Question

Specifying multiple query strategies

HannahKirk opened this issue 3 years ago · 3 comments

When initialising a PoolBasedActiveLearner as active_learner then using active_learner.query(num_samples=20), it is possible to specify more than one query strategy i.e. select 5 examples by PredictionEntropy(), 5 by EmbeddingKMeans(), 5 by RandomSampling() etc.?

I can initialise a new active learner object with a different query strategy for each sub-query but it would be great if you could specify multiple query strategies for the active learner.

Answer 1 · 2022-02-03T12:27:00.000Z

I am unsure if I understood this correctly. What is the desired outcome?

My understanding so far is that you want to build an "ensemble" strategy:
Starting from a query (active_learner.query(num_samples=20)) we delegate the task of drawing samples to several other strategies. In this case, for example, this works if you have exactly 4 strategies where each of them draws 5 samples. The first strategy would select 5 examples and move them to the labeled pool. This means that the second strategy operates on the reduced unlabeled pool, i.e. the order of the strategies would matter. Moreover, what should happen if the query_size is not divisible without remainder by the number of strategies?

Or is this an experiment setup that I do not get?

Answer 2 · 2022-02-03T13:20:12.000Z

Hello :)
I do indeed want to build an ensemble strategy. In my usecase, the number of total queries will always be divisible by 4 so setting the subquery to int(num_samples/4) will work. I guess in the case its not divisible by 4 then the final query strategy would take len(num_samples) - len(labelled_indices), or something to query the remainder of the total query.

In terms of the order mattering, I would like each query to select from the pool, then for these queries to be removed (which I can do with:

q_indices = active_learner.query(num_samples=int(num_samples/4))
 # Simulate user interaction here. Replace this for real-world usage.
y = train.y[q_indices]
# Return the label for the current query to the active learner.
active_learner.update(y)

It's just how I would run this loop for multiple query strategies for the same classifier and same initial pool. The pool is very large (1 million items) so each query strategy should still have a large pick over lots of unlabelled examples even if was say 4th in order to query.

Thanks so much for your help :)

Answer 3 · 2022-02-03T14:09:20.000Z

Thanks for the clarification :). Now I think I do understand.

You could just iterate the strategies and then query() / update() (as in your code above) but then you would also alter the model which is what you don't want, right?.

In this case I would solve this at the query strategy level. Just encapsulate this into a new query strategy:

class QueryStrategyEnsemble(QueryStrategy):

    def __init__(self, strategies):
        self.strategies = strategies

    def query(self, clf, x, x_indices_unlabeled, x_indices_labeled, y, n=10):

        if n % len(self.strategies) != 0:
            raise ValueError('Cannot evenly distribute the query between sub strategies')

        sub_query_size = int(n / len(self.strategies))

        x_indices_unlabeled_copy = np.copy(x_indices_unlabeled)
        x_indices_labeled_copy = np.copy(x_indices_labeled)

        indices = np.empty(shape=0, dtype=int)
        for strategy in self.strategies:
            sub_query_indices = strategy.query(clf, x, x_indices_unlabeled_copy, x_indices_labeled_copy,
                                               y, n=sub_query_size)
            x_indices_unlabeled_copy = np.delete(x_indices_unlabeled_copy, sub_query_indices)
            x_indices_labeled_copy = np.append(x_indices_labeled_copy,
                                               x_indices_unlabeled_copy[sub_query_indices])
            indices = np.append(indices, sub_query_indices)

        return indices

The critical thing here is that you have to manage the unlabeled (edit: and labeled) indices yourself, so that a (sub) query strategy does not "see" the samples which have already picked by previous strategies from that list.

Disclaimer: Untested. I did something similar before with two strategies, which I quickly adapted to the code above. Feel free to use it if it works.