happyharrycn/actionformer_release

Possible to get rid off regression head?

bhosalems opened this issue · 4 comments

Bear with me for opening the issue for suggestions, I didn't see a discussion tab in the repository.

For my task, I only need per-frame classification labels, but in your method calculation of class/label for each frame is tightly coupled with the ranges. For example, label_points_single_video() takes in the gt_class (N) and gives us back cls_targets (T * C) classification labels for each FPN level point where N is a number of events/actions, T is the total number of points at all FPN levels and C is a total number of classes.

Given ground truth without ranges, I was thinking of adding arbitrary range e.g. convert gt class label at time t [c] -> [c, t-delta, t+delta]. I would get rid of the regression head and the regression loss. I thought this would be reasonable to do, but I am not very certain if it would correct with all the handling of FPN levels in ground truth for supervision and later in inference too. Would this be reasonable to do, what do you think?

The task you have described (labeling every frame) is referred to as action segmentation. This is distinct from action localization (as addressed in this repo). If we take 2D images as an analogue, action segmentation is akin to semantic segmentation, where action localization is like object detection. The key difference between the two lies in the identification of individual instances.

These two tasks are indeed related, yet they employ different sets of methods. While it is possible to re-purpose this repo for action segmentation, I'd recommend those methods designed for the task.

Thanks for the input.
Isn't action segmentation just instance segmentation in the 2d image world?
image

This is not true. Let us construct the following example. Say the input video contains two actors, A and B. Actor A is performing action 1 from time step 1-4, actor B is performing the same action 1 from time step 2-5.

  • Action segmentation seek to label time step 1-5 as action 1 (think about semantic segmentation in 2D).
  • Action localization aims at identifying two instances of action 1, one from time step 1-4 and other from 2-5 (think about instance segmentation in 2D).

Got it. Thanks.