kkoutini/PaSST

From ViT models to audio

Opened this issue · 7 comments

Hi Khaled,

In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").

Do we agree that such architectures only work with similar size inputs (224224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224224 to 128*998 for example)? Is this procedure in some code in your repo?

I read the AST paper I guess you took inspiration from and they talk about it in some details.
I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.

Thanks a lot.

Antoine

Hi Antoine,

Yes, the code should support more architectures.

If the input channels are different, the input channels are averaged here and here

If the input size is different (for example, 224x224 to 128x998), the only thing that is changed is the positional embeddings, this is done here
in short, the positional embeddings are interpolated to match the new size (similar to AST). After that, they are averged over time/freq to produce freq/time positional embeddings.

Great, I'll have a look at all that!
Thanks a lot.

Regarding that question, I see in the code that there are different lists of architectures available between get_model function, default_cfg dictionnary and architecture functions.

default_cfg seems to be the most exhaustive but not every architecture in this dict is covered in get_model or has a dedicated function that calls _create_vision_transformer.
Is it just because you didn't test them all or didn't have time to implement everything or is there another specific reason?

See below:
image

image

image

Thanks a lot.

Hi,
I got the basis code from tim library. where it has links for different models then I added the models that I trained one by one in the same fashion with a link to download the weights. The missing ones are the ones that I didn't use. However, I believe it should work if you add more ViT in the same way.

Hi Khaled,

Thanks for the answer.

Regarding your first reply on this thread, concerning the adaptation/averaging of input channels, why is there a sum on dim=1 in the code instead of a mean? In adapt_input_conv here

I think you are right mean should work better

Great, thanks for the confirmation!