lucidrains/vit-pytorch

structural 3D ViT

aperiamegh opened this issue · 4 comments

Just wanted to check whether the 3D model, made for videos can be repurposed to 3D structural transformer with frame = z?

Is there a fundamental difference between x,y,z if I use it like so, or is it symmetrical?

@aperiamegh could you link me to some literature? do you mean for point clouds?

A 3d image as in, e.g. numpy matrix with shape (batch_size, channels, 64(z), 64(x), 64(y))
You could call it voxel image maybe.
So I am asking whether I can give frames = 64 (z dim), and frame_patch_size as the same as path_size.

What I am unsure is whether this z (frame) dimension is symmetrical with x, y. As in, if I make my numpy matrix into (batch_size, channels, 64(x), 64(y), 64(z)) I will get the same calculations / model performance. Or the frame dim is handled differently as a temporal dimension.

@aperiamegh yup, that would work! so the networks in this repository will not account for rotational symmetry

you may need to look at some of the fancier equivariant transformers for that (or simply augment your dataset by flipping along an axis in your dataset class)

Thanks for clarifying, rotational symmetry and augmentation is already accounted for. Wanted to check whether the transformer handles the dimensions similarly.