a-nagrani/VGGVox

number of filters in conv3 layer?

paulfitz opened this issue · 6 comments

I was looking at reimplementing the model in the VoxCeleb paper, and then cross checking with the setup in this repo. In the paper, conv3 has 256 filters, whereas in http://www.robots.ox.ac.uk/~vgg/data/voxceleb/models/vggvox_ident_net.mat it appears to have 384. Did you find that bigger was better for this layer+dataset?

Thanks - big fan of the VoxCeleb paper, great work :-)

Hi, sorry for the late response. There was a typo in the original version of the paper, it has been fixed now : https://arxiv.org/pdf/1706.08612.pdf. Thank you!

The number of filters in the model is correct.

Hi, @paulfitz have you implemented the paper?

I did implement the model, yes (in keras). I didn't use it with the original dataset though.

Thank you for your prompt response. Is your code available in a repository or something?

No, that code isn't available. From the paper and pretrained model, I'd reconstruct the model in keras as something along the lines of:

from keras import backend as K
from keras.layers import (Activation, BatchNormalization, Concatenate, Conv2D,
                          Dense, Dropout, GlobalAveragePooling2D, Input,
                          Lambda, MaxPooling2D)
from keras.models import Model

def drop(f):
    f = Activation('relu')(f)
    f = Dropout(0.5)(f)
    return f

def bn(f):
    f = BatchNormalization()(f)
    f = Activation('relu')(f)
    return f

def make_vox_model(freqs, times):
    f = inputs = Input(shape=(freqs, times))
    f = Lambda(K.expand_dims)(f)
    f = Conv2D(96, (7, 7), strides=2, name="conv1")(f)
    f = bn(f)
    f = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), name="mpool1")(f)
    f = Conv2D(256, (5, 5), strides=2, name="conv2")(f)
    f = bn(f)
    f = MaxPooling2D(pool_size=(3, 3), strides=(2, 2), name="mpool2")(f)
    # 256 in original paper, 384 in a pretrained model
    f = Conv2D(384, (3, 3), padding='same', name="conv3")(f)
    f = bn(f)
    f = Conv2D(256, (3, 3), padding='same', name="conv4")(f)
    f = bn(f)
    f = Conv2D(256, (3, 3), padding='same', name="conv5")(f)
    f = bn(f)
    f = MaxPooling2D(pool_size=(5, 3), strides=(3, 2), name="mpool5")(f)
    f = Conv2D(4096, (9, 1), name="fc6")(f)
    f = drop(f)
    f = GlobalAveragePooling2D(data_format='channels_last')(f)
    f = Dense(1024, name="fc7")(f)
    f = drop(f)
    # 1251 in paper, 1300 in published model?
    f = Dense(1251, name="fc8")(f)
    f = Activation('softmax')(f)
    return Model(inputs=inputs, outputs=f)

if __name__ == '__main__':
    make_vox_model(512, 300).summary()

Of course I could easily be wrong in some nuance.

Thank you so much