Question about transcribe only singing voice data
Closed this issue · 10 comments
Hello,
I am trying to train a model to transcribe only vocal data. I set the parameters as follows: '-tk' 'singing_v1' '-d' 'all_singing_v1', which are the task and training data. However, I encountered an error in the model part: './amt/src/model/t5mod.py', line 633 'b, k, t, d = inputs_embeds.size()', where there are only three dimensions torch.Size([6, 1024, 512]).
How should I modify this to train successfully? Should I set any other parameters?
Thanks!
Hi @Joanna1212
-
Can you show me all of your train.py options? That error seems to be related to the encoder/decoder type?
-
The
singing_v1
task is an experimental option. It uses asinging
prefix token, which is not covered in the paper.all_singing_v1
is also just for quick experimentation, with the sampling probability of the singing dataset increased.
args=('yourmt3_only_sing_voice_3' '-tk' 'singing_v1' '-d' 'all_singing_v1' '-dec' 'multi-t5' '-nl' '26' '-enc' 'perceiver-tf' '-sqr' '1' '-ff' 'moe' '-wf' '4' '-nmoe' '8' '-kmoe' '2' '-act' 'silu' '-epe' 'rope' '-rp' '1' '-ac' 'spec' '-hop' '300' '-atc' '1' '-pr' '16-mixed' '-bsz' '12' '12' '-st' 'ddp' '-se' '1000000'
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py "${args[@]}"
This way 👆!
I only want to transcribe the singing voice track(single-track prediction).
thanks!
I set confit.py 's "num_channels": from 13 to 1 , it seems work, Let's try the training
@Joanna1212
Sorry for the confusion about the task prefix. I looked further into the code, and in the current version, the 'singing_v1' task is no longer supported. We deprecated using prefix tokens for exclusive transcription of specific instruments due to no performance benefits.
- If you set
num_channel=1
with a multi-channel T5 decoder, it will behave the same as a single-channel decoder. As mentioned earlier, it will not use any prefix tokens forsinging-only
. Currently it is recommended to choosedecoder type
as 't5' andtask
as 'mt3_full_plus' for single channel decoding. - When using a multi-channel decoder, it is recommended to use
decoder type
asmulti-t5
andtask
asmc13_full_plus_256
. - The recommended approach for now is to transcribe only singing by extracting the singing program (100) through post-processing, without modifying the code. I'll provide an alternative in the next update through "exclusive" task (as prototyped in
exc_v1
ofconfig/task.py
). - About max iterations, I prefer adjusting
-it
over usingse
or epoch-based counting for better managing the cosine scheduler. See #2 (comment)
Thank you for your detailed response. I'll try training your final model.
Extracting the singing track (100) through post-processing is very easy. I have already completed it.
However, I noticed some minor errors of singing vocie on some pop music (as you mentioned in your paper). Therefore, I hope to supplement some vocal transcription data to improve the accuracy of vocal transcription.
The dataset I want to add consists of complete songs (vocals mixed with accompaniment and splited vocal track) and the corresponding vocal MIDI, just this one track.
I notice you only use vocal track of MIR-ST500, CMedia.
Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ? 🫡
Perhaps I should add vocal datasets to the current dataset in "all_cross_final,"
continuously adding with splited vocal datasets like mir_st500_voc.
and keep task of "mc13_full_plus_256" with multi-channel decoder,
or to complete "p_include_singing" part (probability of including singing for cross augmented examples)
Maybe his would enhance vocal performance based on multi-track transcription?
I noticed that you used temperature-based sampling in your paper to determine the proportions of each dataset.
For my scenario, where I am only interested in vocals, Do you think I should adjust the proportion of the singing voice datasets (MIR-ST500, CMedia) to be higher?
Additionally, you mentioned, 'We identified the dataset most prone to over-fitting, as shown by its validation loss curve.' Did you train each dataset separately to observe this, or did you observe the validation results of individual datasets during the overall training?
Thanks!
Do you think use plenty of converted_Mixture.wav can be better than just online augmentation ?
This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.
"all_cross_final"
Yes, I recommend to modify all_cross_final
in data_preset.py
. For example:
"all_cross_final": {
"presets": [
...
`YOUR_DATASET_NAME`
],
"weights": [..., `YOUR_SAMPLING_WEIGHT`],
"eval_vocab": [..., SINGING_SOLO_CLASS],
...
I noticed that you used temperature-based sampling...
The main point of our paper is that exact temperature-based sampling (of the original MT3) significantly degrades performance. See more details in Appendix G
(not F; 😬 found a typo). However, if the datasets are of similar quality, you can weight them proportionally. For example, if your custom singing data is similar in size to MIRST-500, assign them similar weights. It’s okay if the total sum of the added weights exceeds 1.
did you observe the validation results of individual dataset...
Yes. In wandb
logger, dataloader_idx
is in the same order as the datasets defined in the data_preset.
thanks, I'll try this with more vocal data.
I understand your explanation about the wandb logger. Thank you for your response and advice.
This pre-release code lacks the unannotated instrument masking (for training) feature, which will be added in an update later this month. I've seen a 1-2% performance improvement, which could be higher with more data.
I tried adding some vocal data. Initially, the metrics showed a slight improvement, but soon there was a gradient explosion. The metrics were slightly better on cmedia and mir_st500. 👍
BTW, Please notify me if there is an update😄. Thanks