Some questions about the paper
laohuijiadezhu opened this issue · 1 comments
laohuijiadezhu commented
jasongief commented
Hi, the meaning of this whole sentence is that some methods [5, 12] try to use the noise as supervision, while AVEL [28] aims to find out the (audio-visual) paired video samples. For AVE localization, the audio-visual pair depicting the same event (i.e., paired) can be utilized in a self-supervised manner.