sooftware/conformer

Count of Conformer parameters mismatch with that in the paper

maxwellzh opened this issue · 4 comments

In the Conformer original paper, the number of parameters are
截屏2021-10-18 下午3 22 54

However, with the implementation in this repo, the number of parameters are slightly different

Conformer  small: 10.16 M
Conformer medium: 31.86 M
Conformer  large: 120.11 M

I get the size with this script

from conformer import Conformer


def count_parameters(model) -> int:
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


models = {
    'small': Conformer(
        num_classes=1000,
        input_dim=80,
        encoder_dim=144,
        decoder_dim=320,
        num_encoder_layers=16,
        num_decoder_layers=1,
        num_attention_heads=4,
        conv_kernel_size=31
    ),
    'medium': Conformer(
        num_classes=1000,
        input_dim=80,
        encoder_dim=256,
        decoder_dim=640,
        num_encoder_layers=16,
        num_decoder_layers=1,
        num_attention_heads=4,
        conv_kernel_size=31
    ),
    'large': Conformer(
        num_classes=1000,
        input_dim=80,
        encoder_dim=512,
        decoder_dim=640,
        num_encoder_layers=17,
        num_decoder_layers=1,
        num_attention_heads=8,
        conv_kernel_size=31
    )
}

for size, m in models.items():
    print("Conformer {:>6}: {:.2f} M".format(size, count_parameters(m)/1e6))

Since the convolution layer kernel size couldn't be set to 32, I just set it to 31. But this won't make such difference in number of params.

This is not an official implementation, so there is a slight difference in the number of parameters.
Of course, I tried to implement it as similar as possible to the contents of the paper. :).

Also, num_classes affects.

This is kind of weird. I test several open-source Conformer implementation (I also implement it myself), but none of them can strictly match the reported number of parameters. Do you have any idea where the difference may be?
btw. num_classes is set to 1k according to the paper.

I'm curious, too. I am only speculating that there may be details not mentioned in the paper.