app crashes during training
Closed this issue ยท 23 comments
app crashes with app server connection error. Terminal seems to indicate training has failed.
I had a similar app connection issue when extracting frames, for which the solution was to reduce the number of videos from 24 to 12. However, I assume the issue with training is related to GPU utilisation instead, similarly described here?).
- I am using the local installation through windows subsystem linux.
my GPU setup:
- NVIDIA GeForce RTX 4080
- CUDA V11.6.55
- Nvidia studio driver 552.22
closing and re-launching the app after completing labelling and before starting training seems to improve the progress but still results in the same crash eventually.
potential temporary resolution: run training directly within my terminal to see if the issue is related to just the app?
e.g. run:
python scripts/train_hydra.py --config-path=/mnt/c/Users/X48823DG/lightning-pose/Pose-app/data/lp_mk1/ --config-name=model_config_lp_mk1.yaml
It seems, based on the output and the pytorch issue you linked above, that there is some indexing error happening. You should be able to label a subset of the extracted frames and train models without issue. Were you trying to label more frames during training? (This also should be fine, but just trying to get a sense of what you were doing in the app leading up to the crash)
It's interesting that it crashes so far into training, I haven't seen that before. The model should have been run on every frame in the dataset by that point.
I labelled all the extracted frames (240).
I just tried running training again and it made it to epoch 220 this time (the example I showed above made it to 109). Based on the last 3 attempted training runs, each run seems to make it a little further before crashing. Unless it is utilising data from the previously failed training models?
Are all of those attempts from training within the app?
Did training from the command line help, or did you get crashes there as well?
And these crashes are happening with the supervised baseline model, correct? (easier to start debugging with the simplest model)
Are all of those attempts from training within the app?
Did training from the command line help, or did you get crashes there as well?
they are all from within the app. I have not tried from the command line. Shall I run that now?
And these crashes are happening with the supervised baseline model, correct? (easier to start debugging with the simplest model)
Yes this is the supervised model.
Yes trying from the command line is a good next step, if it crashes there then we know it's not the app, and it focuses the following steps a bit more.
In the meantime, if you look at your CollectedData.csv file, does it include any rows that are all empty? (i.e. are there any frames with no labels?)
Yes trying from the command line is a good next step, if it crashes there then we know it's not the app, and it focuses the following steps a bit more.
I seem to be getting this error, despite the path being accurate. the hydra config file loads ok then this error is thrown:
In the meantime, if you look at your CollectedData.csv file, does it include any rows that are all empty? (i.e. are there any frames with no labels?)
there are no empty rows.
looks like the error is because data/lp_mk1
is repeated twice in the path. the app does some manipulation of paths in the background. I would make sure the data path in the config file is an absolute rather than relative path for running from the command line, but you'll have to make sure and switch it back to the relative path when you want to train from the app again (sorry for the complication there)
ok cool, good to hear. maybe try running from the app again? It's very strange that it progressed further and further each time. There is not really a difference between the training script you ran directly from the command line and the training function called by the app (but clearly something is going on).
Re: epochs vs iterations, see the config docs here, specifically the min/max epochs parameter - let me know if that answers your question!
ok cool, good to hear. maybe try running from the app again? It's very strange that it progressed further and further each time. There is not really a difference between the training script you ran directly from the command line and the training function called by the app (but clearly something is going on).
So I ran it again in the app and this time it completed the supervised model! But now it seems to be stuck with starting the semisupervised model. The app remains 'running' but nothing is happening in the terminal
Here is an error I spotted in my terminal:
/mnt/c/Users/X48823DG/lightning-pose/LP/lib/python3.10/site-packages/fiftyone/db/bin/mongod: error while loading shared libraries: libcrypto.so.1.1: cannot open shared object file: No such file or directory
...
fiftyone.core.service.ServiceListenTimeout: fiftyone.core.service.DatabaseService failed to bind to port
as a temporary solution, I tried training from the terminal using a fresh config (as per your config guides). The supervised training runs successfully but my output directory is still missing the bold files below. I assume the semisupervised training also ran due to the existence of the predictions_pca, but I am not sure about the context model?
/path/to/models/YYYY-MM-DD/HH-MM-SS/
โโโ tb_logs/
โโโ video_preds/
โ โโโ labeled_videos/
โโโ config.yaml
โโโ predictions.csv
โโโ predictions_pca_singleview_error.csv
โโโ predictions_pixel_error.csv
Re: fiftyone error: if you go to the FIFTYONE tab does it render the page? (Even if it say "No dataset selected" that's fine, I'm curious if you just get a blank page or a page that looks like it's trying to load something but never does)
Re: model training: I'm wondering if this has something to do with WSL...I haven't had this problem with native Linux before.
The output directory you shared is just for the supervised model. The predictions_pca_singleview_error.csv
file is computed for every model, as long as you have indicated in your config which keypoints should be included in the pca loss. This is helpful, for example, when trying to identify outliers (the pca metric complements likelihood as a measure how how good the prediction is).
If the semi-supervised model ran it would be in a completely separate YYYY-MM-DD/HH-MM-SS/
directory, so it seems it didn't run.
Regarding the video predictions, these are additional flags you need to set in the config file; see the relevant parameters in the eval
section of the config file, highlighted at the bottom of this docs section.
The flags will automatically run inference on a set of videos after the model completes training. If, as in your case, a model has already been trained and you'd like to run inference on a set of videos, see the inference documenation.
Also, I'm not sure what version of the app you're currently using, but if you run (from inside the Pose-app
directory) git pull --recurse-submodules
you might get some updates. We have another batch of updates coming out by next week at the latest that refactors some of the way model training is launched, maybe that could help?
Re: fiftyone error: if you go to the FIFTYONE tab does it render the page? (Even if it say "No dataset selected" that's fine, I'm curious if you just get a blank page or a page that looks like it's trying to load something but never does)
I get a page that is trying to load something but never does.
If the semi-supervised model ran it would be in a completely separate YYYY-MM-DD/HH-MM-SS/ directory, so it seems it didn't run.
Can each model be ran separately in the terminal? if so How can I manually run the semi-supervised and context models?
git pull --recurse-submodules
It is already up to date. Perhaps next week's updates will make a difference as you suggest.
Ok for fiftyone I'm guessing there's an issue with WSL, but I'll look into the error a bit more.
Yes you can run each model separately in the terminal. Check out the docs for the config file and let me know if you have any questions that aren't answered there.
You might also try training models one at a time from the app - if it worked for the supervised model, you could try just selecting the context model and see if it trains to completion.
Please keep me updated.
within the app,
all the models will train when selected individually, one by one. After each training session, I shutdown and relaunch the app before training a different model.
Ok not ideal but at least it's (partially) working!
@DavidGill159 just wanted to check in on this - are you still only able to train one model at a time, or are you able to train multiple now?
Hi @themattinthehatt the app didnt end up being reliable enough to train any networks beyond supervised. For not I am just training my models directly through the terminal. on another note, I am currently training a semisupervised ctx model on 600 labelled frames (the quality of my previous network, with 240 labelled frames was poor) and based on the time spent producing 1 epoch, it seems like it will take 3 days to run through all 300 epochs. Is this normal? my GPU is being utilised.
ugh I'm really sorry to hear about the app not working. We haven't had that issue before with the linux installations, so maybe it is a windows/WSL issue? We'll try to look into that.
3 days for training is definitely not normal (although the semisupervised context models are the most computationally demanding). A few questions:
- how big are your images?
- what kind of GPU are you running on?
- what is the GPU utilization you're seeing? you can run
watch -n 0.5 nvidia-smi
in another terminal to get usage updates twice a second - how long did it take to train a basic supervised model? the semisupervised context model should take ~7x longer (though that depends on several different factors)
- would you mind sharing your config file here for me to look at?
Sure, here is all you requested:
how big are your images?
488 KB, 1280x1024, 96 dpi, 24 bit
In my config I resize the images to 384x384.
What kind of GPU are you running on?
NVIDIA GeForce RTX 4080 with Cuda 11.6
GPU utilisation
how long does it take to train a basic supervised model?
about 10-15 minutes on 250 labelled frames.
config
config attached
lp_mk2_config.txt
another note... My computer shut down during training on my latest network (supervised + ctx). I tried resuming training from the checkpoint according to these instructions. However, this didn't work as the total number of epochs (first training session up until checkpoint + epochs since resuming) sums to more than my set epoch limit (750). The absolute path of the ckpt file is correct.
@DavidGill159 sorry for the slow reply, I've been traveling for the past couple weeks.
Regarding the slow training, I don't see any obvious reason why this should be happening given your config file. I am a bit surprised that you're not getting out of memory errors with the semi-supervised context model with resize dims of 384x384 and batch sizes of 16 for both the labeled and unlabeled frames. You could try reducing dali.context.train.batch_size
to 8 instead of 16 and see how that speeds things up?
Regarding resuming training: apologies for the lack of clarity in the documentation - the model.checkpoint
should not state "to continue training from an existing checkpoint" but rather "to initialize weights from an existing checkpoint". In principle it should be possible with pytorch lightning to actually resume training rather than starting over, but we haven't built that in yet. If that's a feature you think would be useful please open a new issue!