Couple questions about classification loss

Hi @rohitgirdhar,

Thanks for your great work -- I found it very interesting and plan to use it in my work! I was hoping to clear up exactly how the loss functions are working with the feature decoding since I was a little confused:

The decoder at each timestep from 1..t outputs features (in causal manner), which are then passed through a linear layer to obtain predicted frame features. Another linear layer on top of this then predicts distribution over action classes. Thus, we have t action predictions. Does the predictions for timestep 1 use the action label from timestep 2? The predictions for timestep t from my understanding represent the action at timestep t+1 (next action we want to anticipate). Based on the implementation, I was wondering if the classification loss also does a loss based on predictions for the next frame's labels and the first frame label is not used? Sorry if this is confusing, hope you can help clear my understanding!

Hi @zerodecoder1
Thanks for your interest and kind words!
Yes, the model at time "t" tries to predict the action at "t+1". The feature regression loss is set up so here:

AVT/models/future_prediction.py

Lines 212 to 214 in 2d6781d

    
           self.future_pred_loss( 
        
               all_outputs_decoded[:, :num_elts_for_loss - 1], 
        
               full_orig_feats[:, 1:num_elts_for_loss])

Regarding the classification loss, the model returns the past features here, which contains the initial features from the model and predicted future

AVT/models/future_prediction.py

Lines 249 to 250 in 2d6781d

    
           updated_past_feat = torch.cat( 
        
               [prev[:, :1, :], all_outputs[:, :(orig_feats_len - 1)]], dim=1)

which are marked as "past" and passed through a classifier

AVT/models/base_model.py

Line 201 in 2d6781d

outputs['past'] = feats_past

and then I incur the loss with the true labels for these frames.

So I don't use the actual future intermediate predictions to predict class labels; however since I incur a feature regression loss with the predicted future, and classify the actual intermediate feature, it effectively forces the predicted future features also be classify-able to the intermediate action classes.

Thanks so much for the detailed response @rohitgirdhar. This clears it up -- just to confirm my understanding, the loss is taken on the features outputted by AVT-B (which are also 'past')?

Also, I had another quick question regarding some experiments in your paper. Are all models with AVT-B trained end-to-end (ex: in table 4)? Do these models also use the loss function with additional terms (feature regression loss/recognition loss)? From your code, I'm guessing the other results with backbones such as TSN/irCSN are trained without end-to-end?

Thanks so much!

Yes that is correct. The "past" features are used to predict the "past" action classes. The future ones could also have been used to predict the future action classes (past actions classes right shifted by 1 for correspondence), though I don't explore that in this work.

Yes correct; AVT-b is trained end-to-end with all the losses (except in ablations where I evaluate the effect of individual losses). The TSN/irCSN backbones are fixed and only the head is trained (similar to prior work like RULSTM etc).

Got it!

Are the AVT-H models trained with TSN/irCSN also trained with the other losses, and are all models trained with 10 seconds past horizon? Thanks!

Yes correct.

Got it -- for the recognition loss for TSN for example, if it is taken on the past features provided by TSN backbone and since TSN is not trained end-to-end, does this loss have any effect on the model weights?

Yes it doesn't change the backbone, but the classifier is applied on the backbone features to get the distribution over the classes, and those classifier weights will get updated with that loss.

Would this update the recognition classifier weights? Is this classifier different than the anticipation classification head, or is it the same classifier head (MLP) that is used for both recognition and anticipation? Thanks!

It is the same classification layer that decodes any feature (past or future) into the classification logits

AVT/models/base_model.py

Line 206 in 2d6781d

self._apply_classifier(feats_past_drop,

	self.future_pred_loss(
	all_outputs_decoded[:, :num_elts_for_loss - 1],
	full_orig_feats[:, 1:num_elts_for_loss])

	updated_past_feat = torch.cat(
	[prev[:, :1, :], all_outputs[:, :(orig_feats_len - 1)]], dim=1)