Training of BS-RoFormer

Question

Training of BS-RoFormer

ZFTurbo opened this issue a year ago · 36 comments

I tried to train this neural net without any success. SDR stuck in around 2.1 for vocals and never grows more. If somebody have better results please let me know.

Answer 1 · 2023-09-25T18:26:53.000Z

@ZFTurbo did you follow the band splitting hyperparameters as in the paper?

Answer 2 · 2023-09-25T21:41:14.000Z

@ZFTurbo did you follow the band splitting hyperparameters as in the paper?

Unfortunately, no. I don't really understand how authors do band split. Also it must be done inside the NN code, which I left unchanged.

In standard band split method it is made very easy. Large plain is split into several planes of the same size by frequency. And you have like several image channels for this. (4096, 512) -> (8, 512, 512). Each channel represents frequencies of some range.

Answer 3 · 2023-09-25T21:46:57.000Z

From paper.

We use the following band-split scheme: 2 bins per band for frequencies under 1000 Hz, 4 bins per band between 1000 Hz and 2000 Hz, 12 bins per band between 2000 Hz and 4000 Hz, 24 bins per band between 4000 Hz and 8000 Hz, 48 bins per band between 8000 Hz and 16000 Hz, and the remaining bins beyond 16000 Hz equally divided into two bands. This results in a total of 62 bands. All bands are non-overlapping.

I don't understand it in terms of tensors.

Answer 4 · 2023-09-25T21:51:41.000Z

I think they use the same method as in BSRNN neural net for BandSplit:

https://github.com/sungwon23/BSRNN/blob/main/module.py#L73

Answer 5 · 2023-09-25T21:54:37.000Z

yea, I'll get around to setting the proper band frequency hyperparameters and you can give it another go

I'm growing interested in this technique, as I think it can be applied for medical segmentation

Answer 6 · 2023-09-25T21:55:51.000Z

From paper.

We use the following band-split scheme: 2 bins per band for frequencies under 1000 Hz, 4 bins per band between 1000 Hz and 2000 Hz, 12 bins per band between 2000 Hz and 4000 Hz, 24 bins per band between 4000 Hz and 8000 Hz, 48 bins per band between 8000 Hz and 16000 Hz, and the remaining bins beyond 16000 Hz equally divided into two bands. This results in a total of 62 bands. All bands are non-overlapping.

I don't understand it in terms of tensors.

yes you would need to set it according to this. I was also planning on adding the overlapping frequency bands they mentioned in the section of stuff they wanted to try next

Answer 7 · 2023-09-25T21:57:02.000Z

ZFTurbo commented a year ago

Answer 8 · 2023-09-25T21:57:59.000Z

yes, it is all built.

you need to have the correct tuple of 62 integer here https://github.com/lucidrains/BS-RoFormer/blob/main/bs_roformer/bs_roformer.py#L222

Answer 9 · 2023-09-25T22:03:02.000Z

So I need to make tuple which consists of 62 integers? What these integers mean? Is it ranges in Hz?

For example first 6 if follow the paper: (500, 1000, 1250, 1500, 1750, 2000, ...)?

UPD: I think it's no in Hz. It's 62 numbers from 0 up to (FFT size / 2).

Answer 10 · 2023-09-25T22:04:46.000Z

yes, I believe it is exactly ranges of freqs in order of low to high frequencies. however I have not sat down and worked out that section of the paper yet

Answer 11 · 2023-09-25T22:17:00.000Z

Also: 2 + 4 + 12 + 24 + 48 + N > 62 May be I don't understand what 62 is.

Answer 12 · 2023-09-25T22:37:40.000Z

haha, I honestly haven't sat down and worked it out myself yet. but it is ok since I think the big idea is just uneven splits across frequencies and project to tokens with own MLP

you should just try a naive even split of frequencies by 64 and run it again, before going for the precise breakdown

Answer 13 · 2023-09-25T22:38:19.000Z

at the moment, you are doing attention of 2 tokens across the frequency axis, which does basically nothing

Answer 14 · 2023-09-25T22:53:57.000Z

I'm trying this now:

model = BSRoformer(
        stereo=True,
        dim=256,
        depth=12,
        time_transformer_depth=2,
        freq_transformer_depth=2,
        freqs_per_bands=(
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 33,
        )
    )

Answer 15 · 2023-09-25T23:03:14.000Z

nice! yeah, that should work better

Answer 16 · 2023-09-27T11:13:04.000Z

Hi, @ZFTurbo ,Do you have the results yet?

According to what is written in the paper,

"According to our observations, the training progress of BS-Transformer is very slow, and it still remains low SDRs after two weeks of training on Musdb18HQ. Instead, BS-RoFormer models with L=6 get converged within a week."

and

"Models with L=6 are trained solely on the Musdb18HQ training set using 16 Nvidia V100-32GB GPUs."

This model likely requires many GPUs to train for several weeks.

Answer 17 · 2023-09-27T14:55:36.000Z

I'm currently training:

 model = BSRoformer(
        stereo=True,
        dim=128,
        depth=12,
        time_transformer_depth=1,
        freq_transformer_depth=1,
        freqs_per_bands=(
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 33,
        )
    )

It's training very very slow. I use batch size 12. One epoch total 1000 batches requires 1 hour 40 minutes to finish. But loss is constantly decreasing now. From logs:

Training loss: 3.939826 SDR vocals: 1.7817
Training loss: 2.848435 SDR vocals: 1.4592
Training loss: 2.587866 SDR vocals: 1.5311
Training loss: 2.503295 SDR vocals: 1.3758
Training loss: 2.497599 SDR vocals: 1.9409
Training loss: 2.438752 SDR vocals: 1.8686
Training loss: 2.427704 SDR vocals: 1.4726
Training loss: 2.412679 SDR vocals: 1.4579
Training loss: 2.384547 SDR vocals: 1.7167
Training loss: 2.401358 SDR vocals: 2.0094
Training loss: 2.358802 SDR vocals: 1.3975
Training loss: 2.351327 SDR vocals: 1.6052
Training loss: 2.350673 SDR vocals: 1.9241
Training loss: 2.311954 SDR vocals: 2.3140

Answer 18 · 2023-09-27T14:56:34.000Z

@ZFTurbo hey, i apologize but the author reached out last night, and there was a bug in how i was folding the dimensions

should be fixed in the latest version! (now it is actually axial attention 😓 )

Answer 19 · 2023-09-27T14:56:48.000Z

should train faster too

Answer 20 · 2023-09-27T17:59:21.000Z

I restarted. It became much faster. 15 minutes now for the same epoch.

Answer 21 · 2023-09-27T18:07:06.000Z

nice yea, that's axial attention at work

Answer 22 · 2023-09-28T07:13:10.000Z

I'm currently training:

 model = BSRoformer(
        stereo=True,
        dim=128,
        depth=12,
        time_transformer_depth=1,
        freq_transformer_depth=1,
        freqs_per_bands=(
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 32,
            32, 32, 32, 32, 32, 32, 32, 33,
        )
    )

It's training very very slow. I use batch size 12. One epoch total 1000 batches requires 1 hour 40 minutes to finish. But loss is constantly decreasing now. From logs:

Training loss: 3.939826 SDR vocals: 1.7817
Training loss: 2.848435 SDR vocals: 1.4592
Training loss: 2.587866 SDR vocals: 1.5311
Training loss: 2.503295 SDR vocals: 1.3758
Training loss: 2.497599 SDR vocals: 1.9409
Training loss: 2.438752 SDR vocals: 1.8686
Training loss: 2.427704 SDR vocals: 1.4726
Training loss: 2.412679 SDR vocals: 1.4579
Training loss: 2.384547 SDR vocals: 1.7167
Training loss: 2.401358 SDR vocals: 2.0094
Training loss: 2.358802 SDR vocals: 1.3975
Training loss: 2.351327 SDR vocals: 1.6052
Training loss: 2.350673 SDR vocals: 1.9241
Training loss: 2.311954 SDR vocals: 2.3140

In my opinion, raw audio is 44100Hz, n_fft is 2048, so we will get 1025 bins, every bin is about 44100/2/1025 ~= 21.5Hz.

freq	bins	bins per band	bands
f<1000	46.5	2	24
1000<f<2000	46.5	4	12
2000<f<4000	93.0	12	8
4000<f<8000	186.0	24	8
8000<f<16000	372.1	48	8
16000<f<22050	281.4		2

so total bands is 24 + 12 + 8 + 8 +8 + 2 = 62

Answer 23 · 2023-09-28T08:57:54.000Z

Yes. I already fixed it:

band_split_params = (
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
      2, 2, 2, 2,
      4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
      12, 12, 12, 12, 12, 12, 12, 12,
      24, 24, 24, 24, 24, 24, 24, 24,
      48, 48, 48, 48, 48, 48, 48, 48,
      128, 129,
  )

Answer 24 · 2023-09-29T17:29:54.000Z

think this should be solved

Answer 25 · 2023-10-02T14:54:28.000Z

Hi @lucidrains + @ZFTurbo

I work with birds and have a audio dataset the sounds they make that I would like to train with this code, I have had success with other models that were authored for music. Please can I request for a way to train this model to be added to the code? Sorry i can not accomplish this myself.

Answer 26 · 2023-10-12T15:45:14.000Z

@ZFTurbo hey, have you ever played around with complex neural networks? do you think it is worth doing a complex version of BS-Roformer. or a waste of time?

Answer 27 · 2023-10-15T10:45:56.000Z

@ZFTurbo hey, have you ever played around with complex neural networks? do you think it is worth doing a complex version of BS-Roformer. or a waste of time?

No, I never tried.

Answer 28 · 2023-10-15T17:21:43.000Z

@ZFTurbo ah ok, thought you may have tried it before, as it seems you are in the business of winning kaggle competitions

Answer 29 · 2023-10-15T17:22:12.000Z

@ZFTurbo ok, maybe i'll think for a bit more until deciding whether to do a complex version of BS-Roformer

Answer 30 · 2023-10-15T17:36:39.000Z

@lucidrains They have published a derived work recently, that is also using Roformer but instead of BandSplit, they've used Mel matrixing : https://arxiv.org/abs/2310.01809 that would be really great to see it reproduced too !

Answer 31 · 2023-10-15T18:28:00.000Z

@jarredou oh wow, yes indeed

in the spirit of open source, PRs are always welcome, but feel free to open an issue and i can get back to this if no one else does

Answer 32 · 2023-10-15T19:20:39.000Z

I'll share that I only open sourced this work because a precocious high schooler reached out asking me to consider it lol

Answer 33 · 2023-10-17T03:36:53.000Z

@jarredou had some time this evening and knocked it out here

think this is may be my last open sourcing in the music separation space for a while

Answer 34 · 2023-10-17T18:41:57.000Z

@lucidrains You're amazing ! ❤️

Answer 35 · 2023-10-18T14:42:36.000Z

@jarredou thanks for the sponsor! 🙏 means a lot

Answer 36 · 2023-11-07T08:11:37.000Z

I published my training code here:
https://github.com/ZFTurbo/Music-Source-Separation-Training