openai/jukebox

Fine-tuning models

ElizavetaSedova opened this issue · 24 comments

Hello! Could you please tell me what resources are needed to fine-tune the pre-trained top level before the new style(s). I want to use model 5b. Is it possible without GPipe?

I run this but with my data:

mpiexec -n {ngpus} python jukebox/train.py --hps=vqvae,prior_1b_lyrics,all_fp16,cpu_ema --name=finetuned \
--sample_length=1048576 --bs=1 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \
--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000

I set a limit of 30 seconds for my data, but my process is being killed due to lack of RAM. That being said, I'm using the 1b model for now. What else can help reduce the load on my machine's resources?
When trying to use the 5b model, an error appears about the lack of cuda memory. How much memory do you need video cards?

CCpt5 commented

I also am curious about this and have been attempting to train on the 5b model. Colab's new pricing system has made it easy to gain access to an A100 w/ 40gb RAM. Is this not sufficient? My attempts have also failed on CUDA memory issues.

I have finetuned the 5b models and can probably leave the last word on all the various requests for how to do so over the years:

You need to remove the DDP wrapper for the 5b and then run training with pytorch 1.7.1 (versions later than this introduce a memory leak in openai's fp16 optimizer and require you to port to the official pytorch fp16 optimizer pattern) using a GPU with at least 48gb of VRAM. Also unless you delete the initial optimizer state and then reload it from the checkpoint you will never be able to restore an optimizer checkpoint with only 48gb of vram.

In order to tune the 5b_lyrics model I needed to use model parallel and if I am remembering correctly it uses about 55Gb of VRAM and took quite a bit of effort to get working.

@btrude I am having a dataset containing some wav music files (my use case is only concerned with music, not lyrics).
I would like to fine-tune the model with this dataset. After fine-tuning, the model should be able to generate music samples without any lyrics in it. Please help me with how can it be done.

@btrude how long does fine-tuning the 5b models take approximately per second of training data?

@btrude could you possibly share the changes you made to get 5b_lyrics working, like removing the dependence on ddp and deleting the optimizer state? We're at pytorch 1.10, not sure if the memory leak is still an issue, but I can test that.

btrude commented

@tanner-ducharme It was something like 6s /iteration when I did this originally on an RTX 6000 with a batch size of 1 and it takes thousands of steps to achieve good results with finetuning depending on how much data you have. I have not done it recently on newer hardware/with better code.

@mackamann my post was referring to code I wrote in 2021 which, since then, has been changed significantly as I have been refactoring this repo almost completely in my free time to use a bunch of different things like flash attention, different encoders/decoders, key/bpm/color/text/image embeddings etc. Sadly, that code is private now and won't be in a place to be released for a very long time, if ever.

If you wanbt to finetune the 5b/5b_lyrics models these days then I would recommend renting an A100-80gb from paperspace (its only $3 /hour right now and one can reasonably finetune those models in at most 4 or 5 days but probably less) and then just using the code here as you would with the smaller models. Another option to consider that works well for me and speeds things up considerably is writing a script that preprocesses your data into vae codes before you train, then you can remove the vqvae encode/decode functions from the prior and just load the codes directly without using GPU memory/processing time to encode the audio. I would guess that you can finetune the 5b with that change alone using a 48gb GPU but I haven't tried any of the openai configurations to say for certain.

@btrude thanks for the info, much appreciated!

I haven't had much luck signing up with cloud compute providers, they seem to be very picky about who they'll let in, esp for A100/80G (paperspace included)... I've been waving my CC around, but so far no takers. :)

Your suggestion about what seems to be pruning sounds interesting, not sure how I would go about that though. So, you would be pre-processing the audio into vae codes rather than encoding/decoding?

btrude commented

@mackamann I would recommend telling them that you are doing independent research or similar though I also had professional uses to discuss with them at the time so that was maybe a factor in getting those instances turned on 🤷‍♂️

but the training procedure for any of the transformers in this repo is roughly:

  1. Load audio
  2. Ensure the audio is mono
  3. Pass the audio to the VQVAE to get the appropriate codes for the level being trained
  4. Pass those codes to the transformer
  5. repeat

The optimized procedure is:

  1. Create a script that loads the codes from the vqvae, eg for the top level priors:
vqvae = make_vqvae(...)
for x in audio_files:
  x = audio_preprocess(x, hps)
  z = vqvae.encode(x, start_level=2, end_level=3)[0]
  torch.save(z, random_filename) 

Observe that the vqvae is only superficially connected to the prior in this repo (which in my opinion is an anti-pattern):
https://github.com/openai/jukebox/blob/master/jukebox/prior/prior.py#L52...L54

  1. So then you remove that relationship (by no longer passing the vqvae functions to the prior) and then just load and pass the tensors from those files you already encoded in step 1. This reduces memory (you only have to load the vqvae when you want to decode e.g. samples to raw audio, but then you can just delete it before you continue training) and saves time if you are doing >1 epoch as you are never encoding your data >1 time.

thanks for that! I took a shot at it earlier and am still struggling, I'll try your approach, this what I have (unfinished) with copious amounts of help from chatgpt for the python, the forked jukebox is mine and has changes to work in newer colabs. I hadn't started to figure out how to load these when training.

!rm -rf jukebox
!git clone https://github.com/mackamann/jukebox-finetune-train jukebox

%cd jukebox
!pip install .
%cd ..
!pip install av==8.1.0

!rm -f vqvae.pth*
!wget https://openaipublic.azureedge.net/jukebox/models/5b/vqvae.pth.tar

import os
import torch
import librosa
import numpy as np
from jukebox.make_models import make_vqvae
from jukebox.data.files_dataset import FilesAudioDataset
from jukebox.hparams import setup_hparams
import jukebox.utils.dist_utils as jdist
import torch.distributed as dist

if not dist.is_initialized():
  jdist.setup_dist_from_mpi(port=29500)

class PreprocessAudioDataset(FilesAudioDataset):
    def __init__(self, hps):
        super().__init__(hps)
        self.vqvae = make_vqvae(setup_hparams('vqvae', {}), 'cpu')
        checkpoint_path = 'vqvae.pth.tar'
        self.vqvae.load_state_dict(torch.load(checkpoint_path, map_location='cpu'))
        self.vqvae.to('cpu')  # Move the model to the CPU
        self.vqvae.eval()  # Set the model to evaluation mode

    def encode_file(self, audio_file):
        data, sr = load_audio(audio_file, sr=self.sr, duration=self.sample_length)
        data = data[np.newaxis, :]

        with torch.no_grad():  # No need to track gradients
            z = self.vqvae.encoders[0](torch.Tensor(data).cpu())  # Encode the audio using the VQ-VAE

        npy_filename = os.path.splitext(os.path.basename(audio_file))[0] + '.npy'
        np.save(os.path.join('/content/gdrive/MyDrive/encoded_audio', npy_filename), z.cpu().numpy())

    def preprocess_all_files(self):
        for audio_file in self.files:
            self.encode_file(audio_file)

class HPS:
    def __init__(self, hps_dict):
        self.sr = hps_dict['sr']
        self.channels = hps_dict['channels']
        self.audio_files_dir = hps_dict['audio_files_dir']
        self.sample_length = hps_dict['sample_length']
        self.min_duration = hps_dict['min_duration']
        self.max_duration = hps_dict['max_duration']
        self.aug_shift = hps_dict['aug_shift']
        self.labels = hps_dict['labels']

hps = {
    'sr': 44100,
    'channels' : 2,
    'audio_files_dir' : '/content/gdrive/MyDrive/mp3s',
    'sample_length': 1048576,
    'min_duration' : 24.0,
    'max_duration' : 64.0,
    'aug_shift' : False,
    'labels' : False,
}

dataset = PreprocessAudioDataset(HPS(hps))
dataset.preprocess_all_files()```
btrude commented

@mackamann At a glance, yes, this looks like the right idea but if you have a lot of data you should put the vqvae on a gpu otherwise this could take a very long time. After you have the converted data you can make a pytorch dataset class that is significantly simpler than the audio loading one from this repo that just loads and chunks your saved files. You will also have to modify the prior class to just accept vqvae codes instead of unencoded audio vectors.

Thanks @btrude.

I'm fortunate that a cloud provider finally accepted me... after a lot of setting things up and writing scripts I got the model to start training and it stopped @ step 1 lol. Somehow, with the default jukebox repo and the small amount of changes that are done by the finetraining notebook, not all of the tensors end up on the GPU. It gets through the initial sample, and then just errors out with:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

I've finetuned 1b_lyrics many times with this same code, so I'm a bit surprised. I figured it would "just work".

btrude commented

@mackamann I recall having to fix a few legitimate bugs in order to get the 5b models training, but just calling tensor.to("cuda") should be enough to past that particular issue

Hehe, I did that and overdid it a bit and ended up over 80G. I also tried yanking out DDP thinking that might be the culprit (that was a losing battle). Setting default device to cuda also didn't seem to help. It did run /once/ and I noticed GPU was @ 77016MiB, so maybe this was just on the hairy edge, and something defaulted to CPU instead? Will pytorch do that? I think I'll have to find some way to get the VRAM footprint down just a little to fit in the 80G reliably.

Do you happen to recall which tensors you changed? the prior is explicitly set to .cpu() in a bunch of spots. still not sure how this manages to work w/ 1b_lyrics with the same code, so odd.

btrude commented

@mackamann Removing ddp should be pretty easy and will definitely give you back some memory if you only have 1 GPU. Also, everything should be on the GPU. I find it a bit difficult to believe that a batch size of 1 for the 5b_lyrics would result in >80gb of VRAM being used if it can be done with 48gb for the 5b, but I really don't remember exactly how much memory was used and I was using model parallel w/ 2 GPUs which would have a different memory profile obviously (which is to say that I could be way off, but I don't think so).

I don't think you are off, it seems way too high...

I will try again removing DDP, I got pretty far, but then out of the blue the training started working. I feel like I just need to do a little optimization to fit consistently in the 80G. Hopefully there's some low hanging fruit somewhere. The VRAM has been consistently at 77016MiB and I'm now at ~9000 steps after running overnight. I agree that the 5b_lyrics shouldn't need that much more VRAM than 5b.

Maybe one of the bugs you fixed lowered the VRAM footprint? It seems like it is almost double what it should be, as if something is not being free'ed after use.

btrude commented

Nice, that's very cool that you got it working - everything is a lot easier with 80gb of ram 😅

very true, and thanks for all your help! if you have any other wisdom to share, I'm all ears... you seem to have chewed on the codebase a LOT. :)

Bah, the auto sample @ 12k iterations caused memory to spike and it it OOM'ed when it tried to resume. :(
Gonna have to disable that.

@btrude when you were finetraining 5b/5b_lyrics, did you also finetune the upsamplers and then do annealing?

btrude commented

I never got improved results from finetuning the upsamplers. In 2023 I would recommend some sort of seq2seq model to a different model (soundstream/encodec/etc) instead if you want to upsample given how much better those models sound compared to the vqvae. I can't comment on how effective annealing the learning rate is for finetuning (though it is very effective during pretraining) but I would at least multiply it by a fraction of the original batch size that is at least somewhat proportional to the actual batch size you are using.

makes sense...

I'm sooooooo close to this working, I've trained 15k steps, ported over the finetuning sampling notebook, was able to sample, and the resulting .wavs were all same, mostly noise. I had this same issue when sampling from a finetuned 1b_lyrics model using the "fast upsample" notebook. That produced this exact same noise. I know that the 1b_lyrics model was fine as I could sample OK with the non "fast upsample" notebook.

@btrude has this ever happened to you? I wonder if it is the hops or something. I'm not sure what knobs would control this behavior.

Also, once I get this all working, I'll post my python scripts here for everyone to use.

CCpt5 commented

Did training on 5b_lyrics model ever work out for anyone?

CCpt5 commented

A user in one of the SD groups mentioned this new PR for python: pytorch/pytorch#106200

Perhaps would solve the VRAM problem?


CUDA Unified Memory
This PR adds support for CUDA Unified Memory, or UVM, to PyTorch, such that we effectively makes device RAM "swappable" to system RAM. We usually have plenty of system RAM (e.g., several terabytes) compared to device RAM (tens of gigabytes), therefore UVM brings us a much large safe margin before hitting OOM.

The feature can be optionally enabled by setting PYTORCH_CUDA_USE_UVM=1 in environment variables.

Ok, so now new NVIDIA driver's have an option in control panel to allow to "Prefer Sysmem offload" which will use RAM in the event VRAM required memory is unavailable (instead or erroring/crashing w/ an OOM msg).

If anyone is willing to use this feature and attempt to write a python script/method to train on 5b_lyrics that'd be super cool!!

(Locally have a 4090 / 24gb VRAM / 64gb RAM - but Runpod/Vaast.ai are also options)

Setting:

nvidia-system-fallback
sdfsdfsdf