Audio-AGI/AudioSep

Error when using music_speech..._89.98.pt: pytorch-lightning_version

tomthecollins opened this issue · 5 comments

From your paper, I wasn't sure of the role/purpose of music_speech_audioset_epoch_15_esc_89.98.pt

Are these the saved model weights one should use if one wants to focus on separation of musical instruments from one another, say? Or is audiosep_base_4M_steps.ckpt still applicable in such use cases?

When I edited your example inference code from the readme to use music_speech_audioset_epoch_15_esc_89.98.pt on a Linux machine running Ubuntu, I got the following error.

Please clarify the purpose/use of this checkpoint, and if it is meant to be used, whether I need to modify the example inference code further.

Thanks!

Traceback (most recent call last):
File "/home/blah/repos/AudioSep/sayd_infer_example.py", line 6, in
model = build_audiosep(
File "/home/blah/repos/AudioSep/pipeline.py", line 17, in build_audiosep
model = load_ss_model(
File "/home/blah/repos/AudioSep/utils.py", line 387, in load_ss_model
pl_model = AudioSep.load_from_checkpoint(
File "/home/blah/anaconda3/envs/AudioSep/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1532, in load_from_checkpoint
loaded = _load_from_checkpoint(
File "/home/blah/anaconda3/envs/AudioSep/lib/python3.10/site-packages/lightning/pytorch/core/saving.py", line 65, in _load_from_checkpoint
checkpoint = _pl_migrate_checkpoint(
File "/home/blah/anaconda3/envs/AudioSep/lib/python3.10/site-packages/lightning/pytorch/utilities/migration/utils.py", line 113, in _pl_migrate_checkpoint
old_version = _get_version(checkpoint)
File "/home/blah/anaconda3/envs/AudioSep/lib/python3.10/site-packages/lightning/pytorch/utilities/migration/utils.py", line 136, in _get_version
return checkpoint["pytorch-lightning_version"]
KeyError: 'pytorch-lightning_version'

I asked the same here, it's seems a model focused on music separation but I wasn't able to load it.

I was able to fix this error by copying the missing keys from the first checkpoint to the second.
But this time the model parameters do not match. I guess the model definition for music separation is not given.

music_speech_audioset_epoch_15_esc_89.98.pt is not used for music source separation. Actually, it is used to initalise the text encoder (https://github.com/Audio-AGI/AudioSep/blob/main/models/clap_encoder.py#L13) of the AudioSep model.

It's from new transformers.

Run this script on the music_speech_audioset_epoch_15_esc_89.98.pt checkpoint: LAION-AI/CLAP#127 (comment)