VegB/VLN-Transformer

image features

Closed this issue · 2 comments

Thanks for sharing your code and pretrained models.
Unfortunately without the exact code to extract the image features from the streetlearn/touchdown panoramas it is not possible to use the pretrained models. Could you please also share those scripts? From the paper(s) it is not clear how to extract them. E.g. what field of view? 45 or 60?; Other example is: paper says 128×100×464 (mean of dim 0) but code expects 1 x 464 x 100 ...
Would be very helpful to replicate the work.

VegB commented

Hi! Thanks for your question regarding image feature generation.

The horizontal rotation step is 45 deg.
Here's part of the code I use to process the streetlearn images:

class ResNet:
    def __init__(self):
        resnet = models.resnet18(pretrained=True)
        modules = list(resnet.children())[:-4]
        resnet = nn.Sequential(*modules)
        for p in resnet.parameters():
            p.requires_grad = False
        self.resnet = resnet

    def __call__(self, x):
        return self.resnet(x)

    def to(self, device):
        return self.resnet.to(device)


def encode_features(model, args):
    raw_features_dir = '/'.join([args.src_dir, args.dataset])
    mean_features_dir = '/'.join([args.dst_dir, args.dataset])
    feature_filenames = os.listdir(raw_features_dir)
    for filename in feature_filenames:
        print('processing %s' % filename)
        try:
            features = np.load(os.path.join(raw_features_dir, filename))  # [8, 460, 800, 3]
        except:
            print("%s failed." % filename)
            continue
        features = torch.from_numpy(features).permute(0, 3, 2, 1).float().to(device)  # [8, 3, 800, 460]
        encoded_features = model(features)  # [8, 128, 100, 58]
        encoded_features = torch.cat(torch.split(encoded_features, [1]*8, dim=0), dim=3).squeeze(dim=0)  # [128, 100, 464]
        features_mean = torch.mean(encoded_features, dim=0).unsqueeze(dim=2)  # [100, 464, 1]
        np.save(os.path.join(mean_features_dir, filename), features_mean.cpu().detach().numpy())
    print("%d pano features encoded!" % len(feature_filenames))


if __name__ == '__main__':
    resnet18 = ResNet().to(device)
    encode_features(resnet18, args)

Hope you find this helpful :)

Hey thanks for that answer. Unfortunately there is still some uncertainty about the image (slice) generation. From what I understand (from here and the paper), the panorama (w=3000, h=1500) is sliced into 8 images (each w=460, h=800). The horizontal center of each image is moved by 45 degrees clockwise. Now when cutting those slices from the equirectangular projection there is also the parameter for field of view (fov). Is this set to 45 or 60 (or something else)? It would be very helpul to also have the script for the panorama cutting and how exactly those [8, 460, 800, 3] numpy arrays are created.