mimbres/YourMT3

Question about transcribe only singing voice data

Closed this issue · 10 comments

Hello,

I am trying to train a model to transcribe only vocal data. I set the parameters as follows: '-tk' 'singing_v1' '-d' 'all_singing_v1', which are the task and training data. However, I encountered an error in the model part: './amt/src/model/t5mod.py', line 633 'b, k, t, d = inputs_embeds.size()', where there are only three dimensions torch.Size([6, 1024, 512]).

How should I modify this to train successfully? Should I set any other parameters?
Thanks!

Hi @Joanna1212

  • Can you show me all of your train.py options? That error seems to be related to the encoder/decoder type?

  • The singing_v1 task is an experimental option. It uses a singing prefix token, which is not covered in the paper. all_singing_v1 is also just for quick experimentation, with the sampling probability of the singing dataset increased.

args=('yourmt3_only_sing_voice_3' '-tk' 'singing_v1' '-d' 'all_singing_v1' '-dec' 'multi-t5' '-nl' '26' '-enc' 'perceiver-tf' '-sqr' '1' '-ff' 'moe' '-wf' '4' '-nmoe' '8' '-kmoe' '2' '-act' 'silu' '-epe' 'rope' '-rp' '1' '-ac' 'spec' '-hop' '300' '-atc' '1' '-pr' '16-mixed' '-bsz' '12' '12' '-st' 'ddp' '-se' '1000000'
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py "${args[@]}"
This way 👆!

I only want to transcribe the singing voice track(single-track prediction).

thanks!

I set confit.py 's "num_channels": from 13 to 1 , it seems work, Let's try the training

@Joanna1212
Sorry for the confusion about the task prefix. I looked further into the code, and in the current version, the 'singing_v1' task is no longer supported. We deprecated using prefix tokens for exclusive transcription of specific instruments due to no performance benefits.

  • If you set num_channel=1 with a multi-channel T5 decoder, it will behave the same as a single-channel decoder. As mentioned earlier, it will not use any prefix tokens for singing-only. Currently it is recommended to choose decoder type as 't5' and task as 'mt3_full_plus' for single channel decoding.
  • When using a multi-channel decoder, it is recommended to use decoder type as multi-t5 and task as mc13_full_plus_256.
  • The recommended approach for now is to transcribe only singing by extracting the singing program (100) through post-processing, without modifying the code. I'll provide an alternative in the next update through "exclusive" task (as prototyped in exc_v1 of config/task.py).
  • About max iterations, I prefer adjusting -it over using se or epoch-based counting for better managing the cosine scheduler. See #2 (comment)

Thank you for your detailed response. I'll try training your final model.
Extracting the singing track (100) through post-processing is very easy. I have already completed it.

However, I noticed some minor errors of singing vocie on some pop music (as you mentioned in your paper). Therefore, I hope to supplement some vocal transcription data to improve the accuracy of vocal transcription.

The dataset I want to add consists of complete songs (vocals mixed with accompaniment and splited vocal track) and the corresponding vocal MIDI, just this one track.
I notice you only use vocal track of MIR-ST500, CMedia.
Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ? 🫡

Perhaps I should add vocal datasets to the current dataset in "all_cross_final,"
continuously adding with splited vocal datasets like mir_st500_voc.
and keep task of "mc13_full_plus_256" with multi-channel decoder,

or to complete "p_include_singing" part (probability of including singing for cross augmented examples)

Maybe his would enhance vocal performance based on multi-track transcription?

I noticed that you used temperature-based sampling in your paper to determine the proportions of each dataset.

For my scenario, where I am only interested in vocals, Do you think I should adjust the proportion of the singing voice datasets (MIR-ST500, CMedia) to be higher?

Additionally, you mentioned, 'We identified the dataset most prone to over-fitting, as shown by its validation loss curve.' Did you train each dataset separately to observe this, or did you observe the validation results of individual datasets during the overall training?
Thanks!

@Joanna1212

Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ?

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

"all_cross_final"

Yes, I recommend to modify all_cross_final in data_preset.py. For example:

 "all_cross_final": {
        "presets": [
            ...
           `YOUR_DATASET_NAME`
        ],
       "weights": [..., `YOUR_SAMPLING_WEIGHT`],
       "eval_vocab": [..., SINGING_SOLO_CLASS],
       ...

I noticed that you used temperature-based sampling...

The main point of our paper is that exact temperature-based sampling (of the original MT3) significantly degrades performance. See more details in Appendix G (not F; 😬 found a typo). However, if the datasets are of similar quality, you can weight them proportionally. For example, if your custom singing data is similar in size to MIRST-500, assign them similar weights. It’s okay if the total sum of the added weights exceeds 1.

did you observe the validation results of individual dataset...

Yes. In wandb logger, dataloader_idx is in the same order as the datasets defined in the data_preset.
Screenshot 2024-09-11 at 13 00 47

thanks, I'll try this with more vocal data.
I understand your explanation about the wandb logger. Thank you for your response and advice.

This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.

I tried adding some vocal data. Initially, the metrics showed a slight improvement, but soon there was a gradient explosion. The metrics were slightly better on cmedia and mir_st500. 👍

BTW, Please notify me if there is an update😄. Thanks