about self.gate2
Closed this issue · 1 comments
dunknsabsw commented
in llama/model/Attention, there is a:
self.gate2 = torch.nn.Parameter(torch.ones(1, self.n_local_heads, 1, 1) * -args.bias)
and in forward of Attention, when compute attention score map there is a:
vt_scores[:, :, video_start + self.max_feats:, video_start:video_start + self.max_feats] =
vt_scores[:, :, video_start + self.max_feats:, video_start:video_start + self.max_feats] + \
self.gate2.half()
i didn't understand. what is the use of gate2?
dunknsabsw commented
I guess this is because the train (from scratch) of video projection. gate2 allow tokens after video ignore the video token when compute score at the early stage of training.