ViT？

Question

ViT？

Opened this issue 2 months ago · 0 comments

After performing feature extraction, can we use a vision transformer to process those features? By asking this, I'm specifically referring to whether it's possible to apply position embedding.