vision4robotics/TCTrack

Changing the length of sequences

notnitsuj opened this issue · 3 comments

Hi,

Congrats on achieving this amazing result. I was so impressed by the speed of this tracker. However, I'm facing a serious accuracy loss when the object becomes occluded (fully, not partial). As can be seen in this video, when the bus goes under the bridge, TCTrack cannot track it correctly anymore. I'm getting the same result with some other aerial videos, while Stark can still handle all of those.

I'm thinking of improving this by trading a bit of performance for accuracy and extending the number of frames that the model saves. Doing so could help save more temporal information and potentially catch the correct object. Is it possible to extend the number L? If so, can you provide some hints on where should I modify?

Thank you.

Thanks for your attention.
Because STARK merely updates the information independently and adopts deeper CNN for chasing tracking performance, it can avoid the interference caused by the environment. But in this paper, our key opinion is that the occluded templated is also valuable for tracking tasks. Let's imagine that if the human tracks a target, he won't close his eyes when the object is occluded. Based on this intuitive observation, we proposed a new temporal method. As mentioned in our paper, our tracker tries to keep a balance between performance and speed rather than only pursuing performance.

To improve the performance by trading a bit of speed, you can 1. try to use more representative features, i.e., adopt a deeper CNN or transformer backbone. 2. increase the L in TAdaConv and the length of the training videos., because the L in TAdaConv will affect the temporal contexts while the length of the training video will affect the long-term tracking performance. 3. try to introduce the trajectory information in the tracking.

Thanks a lot for your response.

  1. increase the L in TAdaConv and the length of the training videos., because the L in TAdaConv will affect the temporal contexts while the length of the training video will affect the long-term tracking performance.

Yes, I was thinking about increasing L in both parts. Is there a way to do so without retraining? I can see that this number is directly related to the structure of these modules, so I think this should be hard to achieve while still utilizing your trained model.

In our conference version, the pretrained model is related to L. Because the parameters of temporal convolutions are related to L, so the previous model cannot use in the new structure. Besides, in our future work (journal version), we try to handle this problem by proposing a new TAdaConv.