It seems that the implementation of Channel only self attention and Spatial only self attention change each other.

Thanks for sharing your work!

PSA/semantic-segmentation/network/PSA.py

Lines 64 to 95 in 588b370

    
           def spatial_pool(self, x): 
        
               input_x = self.conv_v_right(x) 
        
               batch, channel, height, width = input_x.size() 
        
               # [N, IC, H*W] 
        
               input_x = input_x.view(batch, channel, height * width) 
        
               # [N, 1, H, W] 
        
               context_mask = self.conv_q_right(x) 
        
               # [N, 1, H*W] 
        
               context_mask = context_mask.view(batch, 1, height * width) 
        
               # [N, 1, H*W] 
        
               context_mask = self.softmax_right(context_mask) 
        
               # [N, IC, 1] 
        
               # context = torch.einsum('ndw,new->nde', input_x, context_mask) 
        
               context = torch.matmul(input_x, context_mask.transpose(1,2)) 
        
               # [N, IC, 1, 1] 
        
               context = context.unsqueeze(-1) 
        
               # [N, OC, 1, 1] 
        
               context = self.conv_up(context) 
        
               # [N, OC, 1, 1] 
        
               mask_ch = self.sigmoid(context) 
        
               out = x * mask_ch 
        
               return out

It seems that spatial_pool function is the same with Channel-only self attention module.

Can you explain more what do you mean?

@khoshsirat

Does spatial pool function mean Channel-only Self-Attention?

Does channel pool function mean Spatail-only Self-Attention?

OK, I see it now:
The spatial_pool function should be renamed to channel_pool and the channel_pool function should be renamed to spatial_pool.

I have found another discrepancy too:
In the channel_pool function (which should be renamed to spatial_pool), softmax is called after matmul. But in the paper, in the Spatial-only Self-attention block, softmax is used before matmul.

@khoshsirat
You are right
The location of softmax operation in channel_pool function is different with paper explanation.
What's going on?
Which one is correct?

Hi guys, I have created a gist to compare this implementation against External-Attention-pytorch's. Through the simple test case, I found that the outputs are different with kaiming init.

Any idea why?

	def spatial_pool(self, x):
	input_x = self.conv_v_right(x)

	batch, channel, height, width = input_x.size()

	# [N, IC, H*W]
	input_x = input_x.view(batch, channel, height * width)

	# [N, 1, H, W]
	context_mask = self.conv_q_right(x)

	# [N, 1, H*W]
	context_mask = context_mask.view(batch, 1, height * width)

	# [N, 1, H*W]
	context_mask = self.softmax_right(context_mask)

	# [N, IC, 1]
	# context = torch.einsum('ndw,new->nde', input_x, context_mask)
	context = torch.matmul(input_x, context_mask.transpose(1,2))
	# [N, IC, 1, 1]
	context = context.unsqueeze(-1)

	# [N, OC, 1, 1]
	context = self.conv_up(context)

	# [N, OC, 1, 1]
	mask_ch = self.sigmoid(context)

	out = x * mask_ch

	return out