NVlabs/SegFormer

Pre-training info

nargenziano opened this issue · 5 comments

Hello and thanks for the work.

I was wondering if you could share more info regarding the pre-training of MiT architectures.
I've read from other issues that the configs are the same as pvt_v2, but what is the actual pre-training code you used? Is it the PVT classification training?
I tried to edit the PyramidVisionTransformer model to make it identical to MiT B3 and ran the ImageNet classification training of PVT from scratch, however, the classification performance was worse than the one expected for PVT-v2 B3 (around 77.3% Acc@1, instead of the expected 83.1%). What is the expected pre-training performance of MiT?

I have the same question.

The paper states "We pre-train the encoder on the Imagenet-1K dataset".

Does this mean the encoder is trained a classification task first? If so, is there code for this to share? I can not find it in the repo.

Primarily, I want to be able to reproduce the "mit_*pth" files, either conceptually or with your code.

gauenk commented

following up...

Same question here. Seems that the classification head(commented out) in MiT backbone won't work cause the output of stage 4 is B*49*512, and can't directly be followed with an nn.Linear to output B*1000.

following up..

following up...