OFA-Sys/ONE-PEACE

Fine-grained alignment between audio and other modalities

Ming-er opened this issue · 5 comments

Hi, thanks for your work, really nice one!
I notice that the fine-grained alignment between vision and language could be verified by conducted experiments on corresponding tasks (such as referring image segmentation), however, I think that the fine-grained alignment between audio and text might not be validated by downstream tasks you choose since AQA, audio classification, and audio-text retrieval are not sensitive to temporal orders or temporal locations. So, are there any results for some low-level audio tasks such as sound event detection or audio grounding?

@Ming-er Thank you for your suggestion, we haven't conducted experiments on sound event detection or audio grounding. Could you provide the links to the sound event detection and audio grounding datasets? We didn't explore these datasets when conducting audio experiments.

For SED, you could refer https://github.com/DCASE-REPO/DESED_task while for TAG, you could refer https://github.com/wsntxxn/TextToAudioGrounding

Thanks a lot! I will make time to conduct experiments on these two datasets.

Apologies for my delayed response. Over the past few months, I have been occupied with other projects and haven't had sufficient time to test these datasets thoroughly. I would greatly appreciate it if you could help me test them and provide the scores.

This issue is temporarily closed. If there are any relevant results later, we can update them in the repository.