Long time consumed during build_score_decoder

Hello, first of all thanks for share this wonderful work~

My problem is when I tried to test mixformer2_vit_online model on got10k and lasot. The initialization cost tens of minutes on A100. I monitored the process and found out, the process was stucked in the function build_score_decoder inside build_mixformer_vit_online. It seems that only 10MB was being loaded to GPU every 5 or 10 seconds. Would there be any solution to this issue?

Also, Im a little confused with the version of the models. Point it out if I was wrong:
mixformer2_vit_online stands for MixFormerV2-B-256?
mixformer2_stu stands for MixFormerV2-S?

Can you check your test script because the it should call function build_mixformer2_vit_online as below instead of build_mixformer_vit_online.

MixFormerV2/lib/test/tracker/mixformer2_vit_online.py

Line 16 in 94ca2e6

network = build_mixformer2_vit_online(params.cfg, train=False)

mixformer2_vit_online and mixformer_stu are not for MixFormerV2-B-256 and MixFormerV2-S.
mixformer2_stu is for distillation training, and mixformer2_vit_online is for score head training.
The differences between MixFormerV2-B and -S lie in some hyperparameters, such as image size and model depth, which is set in the configuration file in experiments directory.

The build_mixformer2_vit_online was the imported funciton name. Im sure the called function build_mixformer_vit_online is from lib/models/mixformer2_vit_online.py. The starting process cost a lot of time during the sub-function build_score_decoder.

So you mean during distillation, the score head is frozen? And only distillation and pruning on the MixCvT backbone?

UPDATE:
To be more accurate the code stuck during process line 27 of lib/models/mixformer2_vit/head.py

    with torch.no_grad():
        self.indice = torch.arange(0, feat_sz).unsqueeze(0).cuda() * stride # (1, feat_sz)

upgrade torch to 1.7.1 with cuda 11.0 solved my problem.