Input format-Training on one frame of the video clip?

Can you please elaborate on "The batch size is set to 1 per GPU, and each batch contains a video clip with multiple frames. Within each clip, video frames are sampled with random intervals from 1 to 10."

Does this mean the actual model is trained on one frame at a time randomly selected from the clip? I am trying to understand the actual input to the transformer encoder and decoder.

and What is the role of no_grad_frames?

Thank you!!

In our experiments, batch_size refers to the number of video clips (samples). So the batch size is set to 1 per GPU means we process one video clip (which contains multiple frames) on a single GPU. And within each clip, the inter-frame interval is a random number from 1 to 10.

The no_grad_frames means these frames are forward in grad-free mode:

MeMOTR/train_engine.py

Lines 217 to 230 in f46ae3d

    
           with torch.no_grad(): 
        
               frame = [fs[frame_idx] for fs in batch["imgs"]] 
        
               for f in frame: 
        
                   f.requires_grad_(False) 
        
               frame = tensor_list_to_nested_tensor(tensor_list=frame).to(device) 
        
               res = model(frame=frame, tracks=tracks) 
        
               previous_tracks, new_tracks, unmatched_dets = criterion.process_single_frame( 
        
                   model_outputs=res, 
        
                   tracked_instances=tracks, 
        
                   frame_idx=frame_idx 
        
               ) 
        
               if frame_idx < len(batch["imgs"][0]) - 1: 
        
                   tracks = get_model(model).postprocess_single_frame( 
        
                       previous_tracks, new_tracks, unmatched_dets, no_augment=frame_idx < no_grad_frames-1)

However, in our experiments, we deprecated this part. I have not deleted the code from this repo. My suggestion is not to pay attention to this process. Enabling it will not bring performance improvements.

Thank you for the prompt reply!! This is helpful

The input to the model (backbone and encoder/decoder) is a single frame at a time, right? So the way we use the temporal information from video clip is the track information/embedding and memory. am i understanding correctly?

Thank you again :)

Yes. We process only one frame at each time step. The track embedding will propagate the temporal information.

The only difference is that during training, we will process multiple time steps before the optimizer.step(). In this way, the model can learn the ability of temporal modeling.

This is equivalent to, in each training iteration, we process T time steps.

Thank you so much! Can you also please explain the working of the "process_single_frame" function? I want to understand how tracks are generated and how are sub clips connected to each other while predicting. Thank you!!

Our model, as an online tracker, processes one-by-one for the image sequences. So, the function criterion.process_single_frame is used to process the criterion for a single frame at once. For example, as shown below:

MeMOTR/train_engine.py

Line 201 in f46ae3d

for frame_idx in range(len(batch["imgs"][0])):

We will call this function (criterion.process_single_frame) T times in each training iteration, where T is the sampling length for each video clip (from 2 to 5 in our setting on DanceTrack).

At the same time, the function criterion.process_single_frame will also generate the track information (embed & ref_pts, etc.) for the next time step. As shown here:

MeMOTR/train_engine.py

Lines 223 to 227 in f46ae3d

    
           previous_tracks, new_tracks, unmatched_dets = criterion.process_single_frame( 
        
               model_outputs=res, 
        
               tracked_instances=tracks, 
        
               frame_idx=frame_idx 
        
           )

It will update the tracked trajectories previous_tracks and the newborn trajectories new_tracks. Then, they will be combined into an overall tracks here:

MeMOTR/train_engine.py

Lines 229 to 230 in f46ae3d

    
           tracks = get_model(model).postprocess_single_frame( 
        
               previous_tracks, new_tracks, unmatched_dets, no_augment=frame_idx < no_grad_frames-1)

Then, the tracks will be input into the next frame processing, like here:

MeMOTR/train_engine.py

Line 222 in f46ae3d

res = model(frame=frame, tracks=tracks)

which connects the frames in the video clip by propagating the trajectories frame-by-frame. Therefore, our model can build a fully end-to-end training strategy and backward the gradients to the beginning (the first frame).

	with torch.no_grad():
	frame = [fs[frame_idx] for fs in batch["imgs"]]
	for f in frame:
	f.requires_grad_(False)
	frame = tensor_list_to_nested_tensor(tensor_list=frame).to(device)
	res = model(frame=frame, tracks=tracks)
	previous_tracks, new_tracks, unmatched_dets = criterion.process_single_frame(
	model_outputs=res,
	tracked_instances=tracks,
	frame_idx=frame_idx
	)
	if frame_idx < len(batch["imgs"][0]) - 1:
	tracks = get_model(model).postprocess_single_frame(
	previous_tracks, new_tracks, unmatched_dets, no_augment=frame_idx < no_grad_frames-1)