model architecture

Question

model architecture

riyaj8888 opened this issue 3 years ago · 1 comments

can anyone briefly explain how the audio and video features are fused together

?
please use above image as reference which is from org paper

Answer 1 · 2021-10-13T06:52:03.000Z

we apply a small number of 3D convolution and pooling operations to the video stream, reducing its temporal sampling rate by a factor of 4. We also apply a series of strided 1D convolutions to the input waveform, until its sampling rate matches that of the video network. We fuse the two subnetworks by concatenating their activations channel-wise, after spatially tiling the audio activations.

I don't understand this part from paper please assist in this