Couple questions (video length, HIERA)

Great work! I had some questions regarding video input requirements/processing and Hiera:

in the paper, it says you take a sequence of 12 frames for prediction (correct me if I am wrong). What is the fps of the training videos? I am confused on how the video processing works as I want to see if it can take in long videos of people performing multiple actions and correlate that over time (~1k+ frames). Does LART have context for previous batches of frames or does it individually pay attention to one batch at a time?
I saw in a previous issue someone asking for the Hiera backend. I see it here now: https://github.com/facebookresearch/hiera/tree/main
Will this be added in this repo soon as well?

Hi @smandava98, thanks for your interest in our work!
a) The context length of the transformer is 125 (

Line 116 in 829aaae

frame_length: 125

), this can be easily increased with flash attention or memory efficient attention. We just use the 6 frames from both sides of the middle frame (with gt annotations) as a pooling for classification, it has nothing to do with context length.

b) Yes, they have released it as a separate code, but our code requires it to be on slowfast repo, I have contacted the authors and they said it will be released on slowfast repo soon.

Hope this helps, feel free to ask if you have any questions.

Thanks for the response @brjathu! Have a couple followup questions as I am still struggling to understand how it takes in appearance features and poses together to predict actions:

Say I load in a video with an arbitrary number of frames (5 min YouTube video).

Does LART load in full video into memory or just 125 frames at a time? I know PHALP is frame by frame tracking but I am trying to wrap my head around how this handles the case with say a long video of multiple people performing dependent actions (e.g speaker talking to someone who is listening) which require analysis of multiple frames together.
When it does the pooling for classification based on the 12 frames how exactly does it take into account the context from previous batches of frames (last 125 frames I am assuming)?

I read through your paper and I understand mathematically but don't quite understand from a code level perspective yet.
Thanks again!

At the code level, the model first runs tracking and stores the tracks. Then for each track, it runs LART as a moving window, unfortunately, that code is in PHALP repo, I need to fix this! (https://github.com/brjathu/PHALP/blob/f34e5277e76a5f32aa4b826853fa5fba3830f7e7/phalp/models/predictor/pose_transformer_v2.py#L530).

The 12-frame pooling, this was used in the LART eval code, but not implemented in the demo code (https://github.com/brjathu/PHALP/blob/f34e5277e76a5f32aa4b826853fa5fba3830f7e7/phalp/models/predictor/pose_transformer_v2.py#L360). Right now, there is no context shared between two moving windows. The AVA actions are very atomic, and only require a very local context of ~3 seconds, but ideally having a longer context length would be nice!

Thank you @brjathu! It seems like with the context length and the AVA dataset, the frame rate is assumed to be very low (1-2fps). What if the action is shot at a very high frame rate (so the action is spanned over 50+ seconds)? Could the frame pooling be easily extended beyond 12 frames or will that quickly lead to an OOM issue?

The transformer attends over 120 frames (~4 sec) to capture the action. We just do average pooling on the tokens to remove some noise from a single token prediction, this empirically gives a slight boost in performance (only ~0.2 mAP). Apart from that, every token attends every other token (seq length of 128 tokens).

closing due to inactivity, please reopen if you have any questions.