Labbeti/conette-audio-captioning

Model is not learning on Clotho datasets

Closed this issue · 8 comments

Hello,

Thank you for the great work and implementation. I have followed the instructions exactly in the Readme to train on Clotho dataset through
conette-train expt=[clotho_cnext_bl] pl=baseline
The only difference is that I specified cnext_bl_path=$HOME/.cache/torch/hub/checkpoints/convnext_tiny_465mAP_BL_AC**_70kit.pth**, which is the checkpoint that is downloaded when running prepare

However, the metrics are much worse. The fense score is 0.230 compared with the one reported in the paper 0.516, and the SPIDEr score is 0.085 compared with the 0.301. Do you know what might be the issue?

Thank you!

Hi!
Thanks for reporting this.
Seems like something is weird with the pretrained Convnext model, maybe the downloaded file is wrong.
Did you run the AAC model during 400 epochs with the default hyperparameters?

Hello,
Thank you for your prompt response. Yes I run the AAC model with the default hyper-parameters for 400 epochs.
I will test with your fix and let you know.

Hello,
I have tried training again on the dev branch after your fix. However, unfortunately, the problem is still there. Both validation and training loss are decreasing. Yet, they stay high in values and therefore the metrics are poor.
I am looking forward to hear your thoughts about this. Thank you for your help.

Sorry, the commit only fixed an error occurring while loading Convnext, and I forgot you would have this message here. The problem hasn't been fully resolved yet, and I am still looking into it.

No worries. Thank you very much for your help and I am looking forward for the fix.

Hello,
Do you any updates on this?
Also, do you know if the issue is only with the baseline model, or the conette model too (e.g train CoNeTTE on AC+CL+MA+WC, specialized for CL)?

The problem is linked to the CNext model not loading correctly, and the wrong checkpoint being loaded. This means that the pre-processed audio features in the HDF files are invalid, which has an impact on the CNext-trans and CoNeTTE models. I have fixed the loading, and I am now checking whether the audio features are properly calculated.

Hello,
Thank you very much. Yes! this seems to fix the issue.
I can confirm that training on CL and testing on CL obtained fense 0.508 compared to 0.516 in the paper, and spider 0.296, compared to 0.301 in the paper, which is very close.

Thank you again. I am closing the issue for now.