Possible to train on your own music?
christomitov opened this issue ยท 10 comments
Any instructions on what would be required to tag the data and train this to generate music in your style?
Hi, for tag annotation, you can try some NN-based tag model, e.g., https://github.com/minzwon/sota-music-tagging-models, as most tagging models have similar training data, there is no need to worry out-of-domain. For model fine-tuning, you can use smaller lr, e.g., 1e-5, with 2-3 epochs with the train.py file :)
Im just getting into testing the training, but so far it seems my workflow is such:
mp3tag -> export csv of tags (title, artist_name, release (from album), tag_list (from Genre), path (combined from path and filename)
Then I am using https://github.com/seungheondoh/lp-music-caps to generate the Pseudo captions required for the training. The script that does the captioning basically pulls from the mp3tag csv, captions the audio stored at path, then adds the caption and writes it back as a new row to a new csv.
I am still running the captioning conversions, I had a couple issues as the captioning is based on 10 second clips from the audio, so it might miss entirely a vocalist or something. The workaround for now was to take 3-5 10 second samples from around the middle of the audio and combine those 3-5 captions together, then run it through a summarize in T5-basic, which seems to work for now. If it all works well, I can probably put it up, as it should work fine on Windows or linux, and I can throw together a docker compose for it quickly.
This new csv will be converted to a .parquet file, and should then be good to train with the prebuilt scripts in this repo.
I haven't gotten to actually testing the training yet, the captioning and summarizing takes roughly 5-7 seconds per song. I'll update once I do. I will also be testing the repo feizc linked for captioning, as it may well be much better than the one I am using.
EDIT: I also found I got better results when I added the tag_list (Genres) field directly to the end of the caption. The caption all get summarized, which strips a lot of data out of the 3-5 captions, but this helped greatly with preserving the genre/style of the music in the caption, which obviously is important for the training. An easy example of this would be a metal song, where the caption would originally call it aggressive rock or metal, adding the tag_list to the caption before summarization might result in it more correctly calling it progressive metal or death metal. This should result in more specificity in recreating the style during inference.
OK. So it appears you don't really NEED the parquet file, as you just use that to generate a json to train from. Might not be a bad idea with regard to storing or sharing datasets, however. I have been testing on Windows currently, once I get some more concrete numbers I will move it to a GPU linux box based on the required vram.
NOTE: for running training on windows, you will need to do a couple things. First is make sure your nvcc --version is 12.1 (minimum) as this will be required for compiling flash-attn, which will take some time. Prebuilt binaries are not available for Windows, at least by default. You must also pip install progressbar and librosa==0.9.2, the librosa version is to prevent an error on training due to a change in required arguments supplied by audiollm2, and is required on windows or linux. This next change is Windows only. You will need to find this line in train.py:
dist.init_process_group("nccl")
and change it to
#dist.init_process_group("nccl") #currently unsupported on windows, trying gloo
dist.init_process_group("gloo")
The nccl process group is currently unsupported on windows, while gloo is slower but works alright. This should not be required on a linux machine, or if running in a GPU docker container even if the host is windows. These process groups are really only used if doing distributed training, but are currently used even if only running one node.
I have been testing against Giant, which is the worst case for vram usage as it is by far the largest model. Using a smaller model will lower the vram required, but I am not sure by how much as I have just got it working. It may also be possible to reduce vram usage by changing how the models are loaded, such as loading t5_xxl in int8/int4 instead of bf16, though this will always come with some level of reduced quality. If the quality drop is acceptable you can go with that, but it will require a good bit of testing. If you are familiar with the image generation Flux, this is how you can have much much smaller models but still get very good quality out of them, the much much smaller model may be 80/85% of the default model, but a quarter of the size and vram requirements. It may also be possible to load the different models to different GPUs to help prevent the vram <-> system ram swap, which I may fool with as I currently have a system with some GPUs that do not have enough vram to run the whole system, but could each run a component. I have zero idea on how to do this, but I will look into how it was done with the base Flux model and t5.
If testing training, training output is a bit of a mess. Easiest way to check progress is the log, which will by default log every 100 steps. You must also set --ckpt-every unless you want to use the default 100,000, this will change how often it generates a usable model and resume point. It was necessary for me to lower it as I am testing on a machine with insufficient vram, so hitting 100,000 steps will take a long time.
Current results:
(Windows - using gloo - pretty slow)
model: giant - accum_iter 8 - VRAM 23.5 on GPU / 27 Shared / 50.5 Total - This is on windows, on linux it should fit on a 48GB vram gpu, such as an A6000. The OS and programs are using a bit of vram on Windows, which will not be present on a headless linux box. Tests have shown it fails under the 24GB vram restriction, maybe if we unload models before saving this can be worked around.
modeL: large - accum_iter 8 - VRAM 23.6 on GPU / 11.3 Shared / 34.9 Total
@eftSharptooth hi thank you for your comments! How many songs did you train with and what could you control? Did the resulting finetune have the timbre of the singer and what kind of control did you have over the output when using your finetuned version?
I have not had any success with completing a training yet, as it crashes when trying to save the model. Most likely due to running it on Windows (I had to change the nccl to gloo), and it spikes vram while compiling for save. I am hoping to be able to utilize a 48GB card to test the training on linux once that card frees up. If anyone has completed any training so far, could you please just note down what OS, models size and card (VRAM) you used? It would likely help people with going forward with experiments.
@tensimixt Also to better answer your question, the custom dataset was about 12000 full length songs after trimming out anything where the captioning didn't come out with a minimal level of descriptiveness. Those could be done manually, but I just wanted something I could do a test with. The next move is likely to chop those songs up into segments, then recaption those with beginning of song, middle of song, end of song. Or intro, song, outtro. Then add that into the captions and see if I can complete a training that way. Having much shorter segments to train on will significantly reduce the vram requirements, and allow me to do a (very slow to train) model on a 24GB card. I just want to do some tests to see how it all works before having to rent an 8xA100 or 8xA6000 system, as I expect that would be required to train a large or giant model in a reasonable timeframe.
@eftSharptooth I would like to try training this on a large dataset like the one you mentioned. Where did you find this?
which specific dataset you guys are using to train?@eftSharptooth
Sorry for slow response! The database I have been testing with is just my local music, I did full songs and chopped up segments, captioned them all with the same music captioning mentioned in the repo code. I then used a program called mp3tag (free) to push all the mp3 metadata to a csv, merged the corresponding captions in, and used the repo to generate a custom_dataset.json file which you can train with. I had no luck with the full songs, as only a 24GB card was available for testing and it wasnt enough VRAM. I will be trying again soon with the chopped up dataset.
NOTE: I think the csv with mp3tag is only required if you are trying to add more info to the captions. I added artist, genre etc to the caption so that it could learn the styles as well, to give better control over the output. Like I said though, no real success with the full length songs yet on a 24GB card.
I also did download the following datasets for testing with, and also as they were referenced in the repo it allowed me to make sure I was conforming to the required formats when creating my own dataset, though it isn't really necessary anymore, as I posted the relevant details somewhere here in this github:
fma
enrich-fma-large
audioset-full
enrich-audioset-music
The json file for the custom dataset is pretty simple, the most complicated part (for custom datasets) is for sure putting better info into the captions. an example of the json for a custom_dataset is as follows:
{
"wav": "Z:\\Music\\Lady Gaga\\Chromatica (2020)\\Lady Gaga - Chromatica - 08 - 911.mp3",
"label": "the song is medium tempo with a groovy drumming rhythm, percussive bass line, keyboard accompaniment and various percussion hits. it features a passionate female vocal singing over punchy kick and snare hits, shimmering cymbals, wide tinny wooden percussions, groovy synth bass and repetitive synth lead melody. music performed by Lady Gaga in 2020. Dance-Pop,Electronic,House,Pop,Synth-Pop"
},
I pretty much all the stuff at the end is added from the mp3tag data. The captions themselves were tougher, as the captioning model is really only good for (i think) 20 or 30 second clips, so I had it sample beginning, middle and end, then stuck the captions together and summarized them with some super basic LLM. Then tacked the mp3tag info on the end and called it a day. Captioning 12000 songs (and putting it all together into the custom dataset) took a couple days, but I have a python script somewhere that helped with automating it, Ill find it and link here when I do.