Implementation of CvT, Convolutions to Vision Transformers.
This repository gives vision attention and embedding Layer.
Reference Paper
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- CvT: Introducing Convolutions to Vision Transforemrs
ㄴmodel_layer
ㄴtokenizer.py
class ImageTokenizer
class ImageStacker
... ...
ㄴtransforemr.py
class LineTokenConvTransformer
class ConvTokenConvTransformer
class SelfConvTransfomer
class CrossConvTransformer
tensor = torch.ones([8, 3, 16, 16]) # torch.Size([8, 3, 16, 16])
layer = LineTokenConvTransformer((16, 16), (4, 4), 3, 3)
outputs = layer(tensor) # torch.Size([8, 3, 4, 4])
tensor = torch.ones([8, 3, 16, 16]) # torch.Size([8, 3, 16, 16])
layer = ConvTokenConvTransformer((16, 16), (4, 4), 3, 3)
outputs = layer(tensor) # torch.Size([8, 3, 4, 4])
tensor = torch.ones([8, 3, 16, 16]) # torch.Size([8, 3, 16, 16])
layer = SelfConvTransfomer((16, 16), 3, 2)
outputs = layer(tensor) # torch.Size([8, 3, 16, 16])
print(outputs.shape)
tensor1 = torch.ones([8, 3, 16, 16]) # torch.Size([8, 3, 16, 16])
tensor2 = torch.ones([8, 3, 16, 16]) # torch.Size([8, 3, 16, 16])
layer = CrossConvTransformer((16, 16), 3, 1)
outputs = layer(tensor1, tensor2) # torch.Size([8, 3, 16, 16])
Base CvT code is borrowed from @rishikksh20
repo: https://github.com/rishikksh20/convolution-vision-transformers
Base Embedding code is borrowed from @FrancescoSaverioZuppichini
repo: https://github.com/FrancescoSaverioZuppichini/ViT