AppleHolic/source_separation

Why conv1d not conv2d?

Closed this issue ยท 9 comments

Hi, thanks for the great codes.

In the original phase-aware deep unet paper, each layer is conv2d layers. Is there any reason you chose conv1d instead?

@keunwoochoi
When I read paper, I could not find or recognize specifying the dimension of filters. So I choose 1d conv that commonly used when dealing with sequential models. As I know, each channels of spectrogram are represent each frequency level, 2D conv doesn't help much for enhancing sequential models. And already large model for 44.1k singing voice separation has about 90M parameters, so 2d conv was not considered on developing model.

Thanks.

Alright, then I guess you probably have missed the appendix B. In https://arxiv.org/pdf/1903.03107v1.pdf, page 15, the kernel sizes and strides are specified (e.g., F: (7x5), S: (2,2)) along with the channel info (e.g., C:32/45), meaning they're treated as 2d data (batch, channel, freq, time).. or (batch, channel, time, freq).

Oh, I see. I missed following up original architecture, that could be reason why the model gonna be heavy.. I will treat to code original architecture. Thanks to recognize that I roughly implement paper!

No worries! But it seems to be trained very well ๐Ÿ‘ Is the released model (and its samples) trained with Voicebank only? Or both Voicebank and Audio set?

After releasing public repo, I found a bug on augmentation on that trained model that did not trained with audioset. So I recently did retry to train model and get slightly different result.

So uploaded samples and model are voicebank only version.

And I will report the results with audioset. When audioset is roughly used for augmentation on source separation, It has degrading quality general and music on this case. I was so disapointed and I should fix the README.

This issue is handled on tomorrow night and If you tried for getting result with audioset, I'm sorry to notice it late.

I see. I also tried with AudioSet but didn't get a good result so I was curious. Interestingly, voicebank -- which seems not that big and hence not that great -- is much better. I am still very confused..
Thanks for your answers! Please feel free to close :-)

I also thought like that. This case realize me that data quality is more important than data amount and check the code and result in deeply.. Thanks for checking this repo in detail!

Complete to notice and deprecate that issue

@keunwoochoi

When I found that bug, it was already over to train first best model. So I decided to check out in double.

So, when I test above issue to reproduce exam, I saw that loss curve seems same like uploaded best checkpoint file. At this moment, it should be contained audioset files to reproduce that result. (It's under training)

I checked downloaded audioset files, and these are correct that contains 18055 files and volume normalized 22.05k. I will check continuously overall process.

If you wanna tracking that issue, You will be able to see #12 or merged master branch.

Thanks