about self.gate2

Question

about self.gate2

Closed this issue 9 months ago · 1 comments

in llama/model/Attention, there is a:

self.gate2 = torch.nn.Parameter(torch.ones(1, self.n_local_heads, 1, 1) * -args.bias)

and in forward of Attention, when compute attention score map there is a:

vt_scores[:, :, video_start + self.max_feats:, video_start:video_start + self.max_feats] =
vt_scores[:, :, video_start + self.max_feats:, video_start:video_start + self.max_feats] + \
self.gate2.half()

i didn't understand. what is the use of gate2?

Answer 1 · 2024-01-08T12:40:27.000Z

I guess this is because the train (from scratch) of video projection. gate2 allow tokens after video ignore the video token when compute score at the early stage of training.