Is task-agnostic accuracy correct?
ashok-arjun opened this issue · 2 comments
Hi,
I'm not sure if the task-agnostic accuracy is correctly calculated.
I see that in network.py line 44
, you are creating a new Linear layer with nn.Linear(self.out_size, num_outputs)
for each task.
This indicates that each task will have a separate Linear
layer with each having equal capacity.
For task-agnostic accuracy, I see that the outputs of all heads are combined, then softmax is taken.
But in a lot of papers, they use only 1 head for all tasks, to indicate task-agnostic accuracy.
I am aware that when training, they only backpropagate the logits of the classes belonging to the current task, and what is done here is equivalent to that in a way.
May I please know why that is not followed here?
Thank you.
Hi, thanks for your question!
Having a single nn.Linear
with all classes of all tasks and having multiple nn.Linear
and then concatenating their outputs are equivalent implementations:
import torch
fcbig = torch.nn.Linear(64, 50)
fcsmall1 = torch.nn.Linear(64, 30)
fcsmall2 = torch.nn.Linear(64, 20)
# Make weight and bias values equal for both cases
with torch.no_grad():
fcsmall1.weight.copy_(fcbig.weight[:30, :])
fcsmall1.bias.copy_(fcbig.bias[:30])
fcsmall2.weight.copy_(fcbig.weight[30:, :])
fcsmall2.bias.copy_(fcbig.bias[30:])
# Forward
x = torch.rand(64)
y1 = fcbig(x)
y2 = torch.cat((fcsmall1(x), fcsmall2(x)))
print((y1 == y2).all())
You should see tensor(True)
as the output when running this code.
As you say, many papers use the single nn.Linear
implementation for all tasks from the beginning. However, in order to do that, you would need to know the number of classes of each task beforehand when beginning task 1. In other words: you would be having access to information from the "future". Therefore, breaking part of the idea of Continual Learning (although in practice both give the same results).
For this reason, we preferred to use the second implementation because it mimics a more realistic case: whenever a new task comes, the final layer is extended to allow learning the new classes. Without caring about how many tasks or classes will the model see in the future.
I hope I have made myself clear!
Thanks a lot for the detailed reply @mkmenta.
That definitely makes sense, and I understand why this implementation is much better and scalable.