Add in max vit from here for high res stage. Code implementation given in timm and lucidrian.
The basic idea is to split self-attention into blocked attention(local) and grid attention(global) so as doing one
As doing global attention is too expensive+doing just local under fits on the data