lucidrains/musiclm-pytorch

Exception when attempting to train

djqualia opened this issue ยท 26 comments

i'm excited to try this out!

i attempted to train, feeding in a MockTextAudioDataset similar to the example on AudioLM's page (that worked with the semantic trainer there), but encountered the following exception: TypeError: 'int' object is not iterable

Full stack trace, in case it helps:

File "train_mulan.py", line 60, in
trainer.train()
File "<@beartype(musiclm_pytorch.trainer.MuLaNTrainer.train) at 0x7ff0e221f160>", line 30, in train
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 363, in train
logs = self.train_step()
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 330, in train_step
data_kwargs = self.data_tuple_to_kwargs(next(self.dl_iter))
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 57, in cycle
for data in dl:
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/accelerate/data_loader.py", line 375, in iter
current_batch = next(dataloader_iter)
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/qualia/anaconda3/envs/audiolm/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch
return self.collate_fn(data)
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 146, in inner
output = fn(datum)
File "/mnt/c/audio-ml-workspace/musiclm/musiclm_pytorch/trainer.py", line 156, in curtail_to_shortest_collate
min_len = min(*[datum.shape[0] for datum in data])
TypeError: 'int' object is not iterable

@djqualia i'm excited for you to try it! ๐Ÿ˜„

is your dataset at any point not returning a tuple, but a single value? could you possibly show me your script?

import torch
from musiclm_pytorch import MusicLM, MuLaNTrainer
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer, MuLaNEmbedQuantizer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

from torch.utils.data import Dataset

class MockTextAudioDataset(Dataset):
    def __init__(self, length = 100, audio_length = 320 * 32):
        super().__init__()
        self.audio_length = audio_length
        self.len = length

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        from random import randrange
        mock_audio = torch.randn(randrange(self.audio_length // 2, self.audio_length))
        mock_text = torch.randint(0, 12, (256,)).long()
        return mock_caption, mock_audio

trainer = MuLaNTrainer(
    mulan = mulan,
    dataset = MockTextAudioDataset(),
    batch_size = 4
)

trainer.train()

This seems to run fine for me

I figured out that the key difference in my setup, which triggers this bug, is setting batch_size = 1. Then you should be able to reproduce AFAICT

Do you think it's expected for the (contrastive?) loss to be both positive/negative?

spectrogram yielded shape of (65, 26667), but had to be cropped to (64, 26656) to be patchified for transformer
0: loss: -3.088079392910004e-05
0: saving model to /mnt/c/models/audiolm/mulantest
1: loss: -0.008366793394088745
2: loss: -0.00033165328204631805
3: loss: 0.0015411581844091415
4: loss: -0.0021463483572006226
5: loss: 0.003450023476034403
6: loss: -0.001429535448551178
7: loss: -0.001284077763557434
8: loss: 0.01317517552524805
9: loss: 0.07162240147590637
10: loss: -0.0007345117628574371
11: loss: 0.00026188790798187256
12: loss: 0.016698703169822693
13: loss: 0.0058104507625103
14: loss: 0.00037843361496925354
15: loss: -0.00018790364265441895
16: loss: -0.0001080445945262909
17: loss: -0.001908978447318077
18: loss: -0.00023999251425266266
19: loss: 0.03030853345990181
20: loss: -0.00021585077047348022
21: loss: 0.0001592119224369526
22: loss: -0.00013920455239713192
23: loss: -0.0021669212728738785
24: loss: 0.00395401194691658
25: loss: -5.50001859664917e-05
26: loss: -0.0026106592267751694
27: loss: -0.0008263345807790756
28: loss: 0.0012336960062384605
29: loss: 5.2521005272865295e-05
30: loss: -0.005257192999124527

@djqualia hey! fixed the issues, it had to do with batch size of 1 as you figured out

so in contrastive learning, one is forcing the network to play a game of matching up the pair of modalities (text audio in this case), so a batch size of 1 would mean there's nothing to play

for the negative numbers, i believe it is a result of the paper going with decoupled contrastive learning. it is my first time seeing this technique used in the wild, and i believe the loss may be negative

you can try turning it off and it should be all positive values

@djqualia so realistically, MuLaN won't be trained with the code in this repository

we should rely on making open-clip audio compatible

recently they managed to get CoCa working, and so we can easily reach SOTA for audio clip training (think audio clip + coca + some other features in that repository), and also get a great audio captioner to boot

@lucidrains So how should we train the MuLan model? Is it ok if we use batch_size > 1 ?

@ukemamaster yup, you need very high batch sizes, like 128-256

this is why the job is better suited for a group like open clip

that group, affiliated with Laion, has done many SOTA open sourced CLIP models by now

you can try tinkering with it on a small scale though

@lucidrains OK.
And which dataset did you use to train the MuLaN and the AudioLM transformers?

@ukemamaster they are not trained at all yet, outside of google

if you are interested in having something trained, i would recommend joining Laion and getting in touch with Marianne. She is working on amassing a dataset

@ukemamaster i will put my back into getting MuLaN integrated into open-clip next week

@lucidrains Yes. I am very interested in training and reproducing google's results.
I will be waiting for the MuLaN integration. Thanks

thanks for the education @lucidrains !

fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)

to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?

Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains

@djqualia Which dataset you are using to train MuLan? and the AudioLM transformers?

@djqualia @lucidrains It seems like MuLan needs a dataset containing audio and text in pairs, like:

sample_1 = [music_audio_1, text_1]
sample_2 = [music_audio_2, text_2]
.
.
.

But the SoundStream and the 3 transformers for the AudioLM can be trained using audios only.

So my question is:

Is it OK to train MuLaN using music dataset, and SoundStream plus the 3 transformers using a speech (non-music) dataset? yet conditioning them on the MuLaNEmbedQuantizer having MuLaN trained on music data?

OR all need to be trained on the same music dataset?

The original paper said they used pretrained models which I believe all had different datasets

@epinnock The original paper says they used pretrained weights only for MuLaN model. The rest (SoundStream, w2vBERT, and the 3 Transformers) were trained using music data.

Screenshot from 2023-02-09 17-34-08

However, the datasets for MuLaN(according to their original paper), and for MusicLM Transformers are not public.

Thanks for updating this @ukemamaster. are there currently pretrained implementations of soundstream and w2v-BERT. also I saw this paper recently that implements music generation using MuLaN and diffusion. https://google-research.github.io/noise2music/noise2music.pdf. This is fairly outside my area of expertise, also instead of muLaN could you use this? https://github.com/seungheondoh/music-text-representation/ @lucidrains @ukemamaster

thanks for the education @lucidrains !

fwiw i intend to tinker with this mulan implementation (as large a batch size as possible) to see where i get, until another option is available :-)

to make sure i understand, when you talk about integration with open-clip, are you thinking of using open-clip as an implementation for parts of mulan here, or thinking of open-clip as an organization that has the means to create a publicly available model...?

yup both! we can train there, i can build wrappers that have common interfaces that support their pretrained models, as i've done for dalle2

their group, Laion, is a very legit crowd! many many successes by now

actually, i don't think open-clip is under Laion, just that they use the Laion dataset, and have a lot of infra support from Laion + Stability

Hi, just wanted to ask is it possible to integrate the pretrained model from this paper instead of mulan? https://github.com/seungheondoh/music-text-representation/ @lucidrains

i'm not sure, is it any good? honestly, i think there isn't a great open sourced foundation model for text-audio or text-music yet, or i've heard about it by now

@epinnock The original paper says they used pretrained weights only for MuLaN model. The rest (SoundStream, w2vBERT, and the 3 Transformers) were trained using music data.

Screenshot from 2023-02-09 17-34-08

However, the datasets for MuLaN(according to their original paper), and for MusicLM Transformers are not public.

yeah, you all should join Laion and start thinking about data in a collaborative manner

realistically, no one has had the level of success of Laion. i mean, they even have outstanding paper award at the last neurips. it would be wise to simply join their group at this point

ok, i'm going to close this issue, as it has been addressed

@djqualia Have you tried training MuLaN with some real data?
If yes, which data? and how do you convert text into tokens, to feed into the model?

fwiw i have not yet gotten around to training mulan. i have limited hardware and am still trying to train soundstream/hubert first (while putting together a larger dataset) for music sources, consider FMA (free music archive) and jamendo and AudioSet.