Is task-agnostic accuracy correct?

Question

Is task-agnostic accuracy correct?

ashok-arjun opened this issue 3 years ago · 2 comments

Hi,

I'm not sure if the task-agnostic accuracy is correctly calculated.

I see that in network.py line 44, you are creating a new Linear layer with nn.Linear(self.out_size, num_outputs) for each task.

This indicates that each task will have a separate Linear layer with each having equal capacity.

For task-agnostic accuracy, I see that the outputs of all heads are combined, then softmax is taken.

But in a lot of papers, they use only 1 head for all tasks, to indicate task-agnostic accuracy.

I am aware that when training, they only backpropagate the logits of the classes belonging to the current task, and what is done here is equivalent to that in a way.

May I please know why that is not followed here?

Thank you.

Answer 1 · 2021-08-08T21:18:59.000Z

Hi, thanks for your question!

Having a single nn.Linear with all classes of all tasks and having multiple nn.Linear and then concatenating their outputs are equivalent implementations:

import torch

fcbig = torch.nn.Linear(64, 50)
fcsmall1 = torch.nn.Linear(64, 30)
fcsmall2 = torch.nn.Linear(64, 20)

# Make weight and bias values equal for both cases
with torch.no_grad():
         fcsmall1.weight.copy_(fcbig.weight[:30, :])
         fcsmall1.bias.copy_(fcbig.bias[:30])
         fcsmall2.weight.copy_(fcbig.weight[30:, :])
         fcsmall2.bias.copy_(fcbig.bias[30:])

# Forward
x = torch.rand(64)
y1 = fcbig(x)
y2 = torch.cat((fcsmall1(x), fcsmall2(x)))

print((y1 == y2).all())

You should see tensor(True) as the output when running this code.

As you say, many papers use the single nn.Linear implementation for all tasks from the beginning. However, in order to do that, you would need to know the number of classes of each task beforehand when beginning task 1. In other words: you would be having access to information from the "future". Therefore, breaking part of the idea of Continual Learning (although in practice both give the same results).

For this reason, we preferred to use the second implementation because it mimics a more realistic case: whenever a new task comes, the final layer is extended to allow learning the new classes. Without caring about how many tasks or classes will the model see in the future.

I hope I have made myself clear!

Answer 2 · 2021-08-10T01:02:31.000Z

Thanks a lot for the detailed reply @mkmenta.

That definitely makes sense, and I understand why this implementation is much better and scalable.