From ViT models to audio
Opened this issue · 7 comments
Hi Khaled,
In your code, there is the possibility to create a ViT architecture and load the corresponding pretrained weights (like "vit_tiny_patch16_224").
Do we agree that such architectures only work with similar size inputs (224224 for example)? If so, how did you finetune a model on Audioset that was initially trained on Imagenet (going from 224224 to 128*998 for example)? Is this procedure in some code in your repo?
I read the AST paper I guess you took inspiration from and they talk about it in some details.
I was just wondering how I would do the whole process (ImageNet -> AudioSet -> ESC50) on my end.
Thanks a lot.
Antoine
Hi Antoine,
Yes, the code should support more architectures.
If the input channels are different, the input channels are averaged here and here
If the input size is different (for example, 224x224
to 128x998
), the only thing that is changed is the positional embeddings, this is done here
in short, the positional embeddings are interpolated to match the new size (similar to AST). After that, they are averged over time/freq to produce freq/time positional embeddings.
Great, I'll have a look at all that!
Thanks a lot.
Regarding that question, I see in the code that there are different lists of architectures available between get_model
function, default_cfg
dictionnary and architecture functions.
default_cfg
seems to be the most exhaustive but not every architecture in this dict is covered in get_model
or has a dedicated function that calls _create_vision_transformer
.
Is it just because you didn't test them all or didn't have time to implement everything or is there another specific reason?
Thanks a lot.
Hi,
I got the basis code from tim library. where it has links for different models then I added the models that I trained one by one in the same fashion with a link to download the weights. The missing ones are the ones that I didn't use. However, I believe it should work if you add more ViT in the same way.
Hi Khaled,
Thanks for the answer.
Regarding your first reply on this thread, concerning the adaptation/averaging of input channels, why is there a sum on dim=1 in the code instead of a mean? In adapt_input_conv here
I think you are right mean
should work better
Great, thanks for the confirmation!