From ViT models to audio

Question

From ViT models to audio

Opened this issue 9 months ago · 7 comments

Hi Khaled,

In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").

Do we agree that such architectures only work with similar size inputs (224224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224224 to 128*998 for example)? Is this procedure in some code in your repo?

I read the AST paper I guess you took inspiration from and they talk about it in some details.
I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.

Thanks a lot.

Antoine

Answer 1 · 2024-03-30T08:22:08.000Z

Hi Antoine,

Yes, the code should support more architectures.

If the input channels are different, the input channels are averaged here and here

If the input size is different (for example, 224x224 to 128x998), the only thing that is changed is the positional embeddings, this is done here
in short, the positional embeddings are interpolated to match the new size (similar to AST). After that, they are averged over time/freq to produce freq/time positional embeddings.

Answer 2 · 2024-04-02T13:33:45.000Z

Great, I'll have a look at all that!
Thanks a lot.

Answer 3 · 2024-04-04T09:08:52.000Z

Regarding that question, I see in the code that there are different lists of architectures available between get_model function, default_cfg dictionnary and architecture functions.

default_cfg seems to be the most exhaustive but not every architecture in this dict is covered in get_model or has a dedicated function that calls _create_vision_transformer.
Is it just because you didn't test them all or didn't have time to implement everything or is there another specific reason?

See below:

Thanks a lot.

Answer 4 · 2024-04-04T19:11:26.000Z

Hi,
I got the basis code from tim library. where it has links for different models then I added the models that I trained one by one in the same fashion with a link to download the weights. The missing ones are the ones that I didn't use. However, I believe it should work if you add more ViT in the same way.

Answer 5 · 2024-04-18T10:00:27.000Z

Hi Khaled,

Thanks for the answer.

Regarding your first reply on this thread, concerning the adaptation/averaging of input channels, why is there a sum on dim=1 in the code instead of a mean? In adapt_input_conv here

Answer 6 · 2024-04-24T05:04:42.000Z

I think you are right mean should work better

Answer 7 · 2024-04-24T09:03:08.000Z

Great, thanks for the confirmation!