model architecture
riyaj8888 opened this issue · 1 comments
riyaj8888 commented
riyaj8888 commented
we apply a small number of 3D convolution and pooling operations to the video stream, reducing its temporal sampling rate by a factor of 4. We also apply a series of strided 1D convolutions to the input waveform, until its sampling rate matches that of the video network. We fuse the two subnetworks by concatenating their activations channel-wise, after spatially tiling the audio activations.
I don't understand this part from paper please assist in this