Issues about attention type.

Thank you for your work. It is great!

I have some questions when I try to run your code. That is, what is the difference between 'only_space' and 'time_space_joint' attention? They are the same as each other in the code.

Thanks for your interests! 'space_only' only performs spatial attention, while 'time_space_joint' performs both spatial and temporal attentions, as you can see in the following:

Endo-FM/models/timesformer.py

Lines 233 to 235 in 1b33496

    
           if self.attention_type != 'space_only': 
        
               self.time_embed = nn.Parameter(torch.zeros(1, num_frames, embed_dim)) 
        
               self.time_drop = nn.Dropout(p=drop_rate)

Endo-FM/models/timesformer.py

Lines 312 to 327 in 1b33496

    
           # Time Embeddings 
        
           if self.attention_type != 'space_only': 
        
               cls_tokens = x[:B, 0, :].unsqueeze(1) 
        
               x = x[:, 1:] 
        
               x = rearrange(x, '(b t) n m -> (b n) t m', b=B, t=T) 
        
               # Resizing time embeddings in case they don't match 
        
               if T != self.time_embed.size(1): 
        
                   time_embed = self.time_embed.transpose(1, 2) 
        
                   new_time_embed = F.interpolate(time_embed, size=(T), mode='nearest') 
        
                   new_time_embed = new_time_embed.transpose(1, 2) 
        
                   x = x + new_time_embed 
        
               else: 
        
                   x = x + self.time_embed 
        
               x = self.time_drop(x) 
        
               x = rearrange(x, '(b n) t m -> b (n t) m', b=B, t=T) 
        
               x = torch.cat((cls_tokens, x), dim=1)

	if self.attention_type != 'space_only':
	self.time_embed = nn.Parameter(torch.zeros(1, num_frames, embed_dim))
	self.time_drop = nn.Dropout(p=drop_rate)

	# Time Embeddings
	if self.attention_type != 'space_only':
	cls_tokens = x[:B, 0, :].unsqueeze(1)
	x = x[:, 1:]
	x = rearrange(x, '(b t) n m -> (b n) t m', b=B, t=T)
	# Resizing time embeddings in case they don't match
	if T != self.time_embed.size(1):
	time_embed = self.time_embed.transpose(1, 2)
	new_time_embed = F.interpolate(time_embed, size=(T), mode='nearest')
	new_time_embed = new_time_embed.transpose(1, 2)
	x = x + new_time_embed
	else:
	x = x + self.time_embed
	x = self.time_drop(x)
	x = rearrange(x, '(b n) t m -> b (n t) m', b=B, t=T)
	x = torch.cat((cls_tokens, x), dim=1)