YapengTian/AVE-ECCV18

ValueError: could not broadcast input array from shape (10,6,4,512) into shape (10,128)

anandhupvr opened this issue · 4 comments

Hi, thanks for your great work.
While generating audio embedding from the code audio feature size is -- np.zeros([len_data, 10, 128])) but the result from vggish network (ie . shape of embedding_tensor) is (10,6,4,512)

for input audio i converted mp4 file into .wav and input_batch shape is (10, 96, 64)

Could you help me to run the script correctly for generating results for own video?

Hi!

Please change network in vggish_slim.py to:

# The VGG stack of alternating convolutions and max-pools.
net = slim.conv2d(net, 64, scope='conv1')
net = slim.max_pool2d(net, scope='pool1')
net = slim.conv2d(net, 128, scope='conv2')
net = slim.max_pool2d(net, scope='pool2')
net = slim.repeat(net, 2, slim.conv2d, 256, scope='conv3')
net = slim.max_pool2d(net, scope='pool3')
net = slim.repeat(net, 2, slim.conv2d, 512, scope='conv4')
net = slim.max_pool2d(net, scope='pool4')
# Flatten before entering fully-connected layers
net = slim.flatten(net)
net = slim.repeat(net, 2, slim.fully_connected, 4096, scope='fc1')
# The embedding layer.
net = slim.fully_connected(net, params.EMBEDDING_SIZE, scope='fc2')
return tf.identity(net, name='embedding')

thanks

No problem. I do not have other scripts for testing. It should be easy to modify my code to test other videos.