Questions about your implementation

Line 29 in 863bdd1

p = repeat_global_query * key

first of all, thank you for your smart & fast implementation. I'm now studying using this code ^^

Comparing code & architecture image in paper,

in line 29, you declared p, as repeat_global_query * key.
after that when calculating beta_weight, I think this calculated p value should be used
So,
beta_weight = torch.softmax(key * self.scale_factor, dim = -1)
->
beta_weight = torch.softmax(p * self.scale_factor, dim = -1)

like this. After that

global_key = key * beta_weight
->
global_key = p * beta_weight

How about your opinion?

Yes, you are right. I found this problem too, but I missed the global key. Thanks.