trzy/FasterRCNN

Please support to create new backbone based on ViT

Opened this issue · 0 comments

Dear @trzy,

Thank for great repo. I am trying to try the new backbone ViT from your source code. I using the similar template from file: vgg16_torch.py which modify the line 67:
vgg16 = torchvision.models.vgg16(weights = torchvision.models.VGG16_Weights.IMAGENET1K_V1, dropout = dropout_probability)
to
ViT = torchvision.models.vit_b_16(weights=ViT_B_16_Weights.DEFAULT)

Based on ViT concept, the feature should be like
# Expand the class token to the full batch
batch_class_token = vit.class_token.expand(img.shape[0], -1, -1)
feats = torch.cat([batch_class_token, feats], dim=1)
feats = vit.encoder(feats)
We're only interested in the representation of the classifier token that we appended at position 0.
feats = feats[:, 0]

I am still get lost to fix FeatureExtractor function to fix the concept. Please assist if possible.
Many thanks!