when adding a new dataset, I can't run it with another dataset.

Question

when adding a new dataset, I can't run it with another dataset.

yoelshimi opened this issue 3 months ago · 10 comments

Hi,
recently I created using your framework a new dataset of my own "Homography" dataset.
When I try to train a model, using it and another standard dataset, I get the following series of errors:
I train the model neuflow (I tried to shop around and switch model-- didn't help)
datasets: sintel + homography
single GPU
the run command I use is: ptlflow/train.py neuflow --train_dataset sintel+homography --val_dataset sintel+homography --train_transform_cuda --train_num_workers 2 --train_batch_size 8
now when I start to run I get the issue:
python3.10/site-packages/torch/cuda/init.py", line 284, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

running with just sintel works OK, and just homography dataset also.

if I add the command upon the import of train.py, before importing lightning:
torch.multiprocessing.set_start_method("spawn")

then I get later on when trying to pass my data to the GPU in the dataloader:
File "/home/.../work/OIS/ptlflow/ptlflow/data/flow_transforms.py", line 135, in call
inputs[k] = torch.from_numpy(v).to(device=self.device, dtype=self.dtype)
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

When looking at the GPU usage, I find that it isn't being used.

I'd appreciate your help.
Thanks!

I tried this in python3.8, then 3.10.9 (currently running)
pytorch-cuda 11.6 h867d48c_1 pytorch
pytorch-lightning 2.4.0 pypi_0 pypi
pytorch-msssim 1.0.0 pypi_0 pypi
pytorch-mutex 1.0 cuda pytorch
torch 2.1.0 pypi_0 pypi
torchmetrics 1.4.3 pypi_0 pypi
torchsummaryx 1.3.0 pypi_0 pypi
torchvision 0.16.0 pypi_0 pypi

conda list | grep lightning
lightning 1.9.0 pypi_0 pypi
lightning-cloud 0.5.70 pypi_0 pypi
lightning-utilities 0.11.7 pypi_0 pypi
pytorch-lightning 2.4.0 pypi_0 pypi

Answer 1 · 2024-10-14T09:12:51.000Z

the spawn idea comes from: https://stackoverflow.com/questions/72779926/gunicorn-cuda-cannot-re-initialize-cuda-in-forked-subprocess

Answer 2 · 2024-10-14T10:19:05.000Z

Hi, thanks for reporting.

I think this error is caused when using --train_transform_cuda with multiple GPUs. You can try to remove this flag, or use a single GPU.

Unfortunately, I am also not sure what causes this error, and it requires further debugging. In my personal tests this happens with some combinations of datasets. Sometimes the behavior also changes depending on the machine.

If hope that helps.

Best.

Answer 3 · 2024-10-14T12:07:31.000Z

thanks for your response. I removed the train transform cuda flag, re ran on a single GPU and now I get:
File "/home/.../miniconda3/envs/./lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 24803) exited unexpectedly

this happens after it finishes loading the dataset into memory, and on the first training epoch/step. note that validation on multiple datasets does seem to work.

I also changed the dataloader definition in base_model.py the timeout parameter is very large, so I don't get a timeout

Answer 4 · 2024-10-14T14:25:01.000Z

Hmm, I have never had this error before. Does this only happen when using your new dataset or does it also happen when training with sintel only?

Does the stack trace tell where this error starts in the ptlflow code?

Answer 5 · 2024-10-15T07:08:02.000Z

hi, the full trace:
Oops! <class 'RuntimeError'> occurred.
000305 (<class 'RuntimeError'>, RuntimeError('DataLoader worker (pid(s) 16143) exited unexpectedly'), <traceback object at 0x7f0aad647080>)
000306 Traceback (most recent call last):
000307 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
000308 data = self._data_queue.get(timeout=timeout)
000309 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/multiprocessing/queues.py", line 114, in get
000310 raise Empty
000311 _queue.Empty
000312
000313 The above exception was the direct cause of the following exception:
000314
000315 Traceback (most recent call last):
000316 File "/scripts/module_wrapper.py", line 128, in main
000317 mod.main(arguments.split())
000318 File "/home/yoels/work/OIS/ptlflow/train_OIS.py", line 236, in main
000319 train(args)
000320 File "/home/yoels/work/OIS/ptlflow/train_OIS.py", line 189, in train
000321 trainer.fit(model)
000322 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 608, in fit
000323 call._call_and_handle_interrupt(
000324 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 36, in _call_and_handle_interrupt
000325 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
000326 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 88, in launch
000327 return function(*args, **kwargs)
000328 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 650, in _fit_impl
000329 self._run(model, ckpt_path=self.ckpt_path)
000330 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1103, in _run
000331 results = self._run_stage()
000332 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in _run_stage
000333 self._run_train()
000334 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1205, in _run_train
000335 self.fit_loop.run()
000336 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
000337 self.advance(*args, **kwargs)
000338 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 267, in advance
000339 self._outputs = self.epoch_loop.run(self._data_fetcher)
000340 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/loop.py", line 199, in run
000341 self.advance(*args, **kwargs)
000342 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/loops/epoch/training_epoch_loop.py", line 187, in advance
000343 batch = next(data_fetcher)
000344 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/utilities/fetching.py", line 184, in next
000345 return self.fetching_function()
000346 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/utilities/fetching.py", line 265, in fetching_function
000347 self._fetch_next_batch(self.dataloader_iter)
000348 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/utilities/fetching.py", line 280, in _fetch_next_batch
000349 batch = next(iterator)
000350 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/supporters.py", line 569, in next
000351 return self.request_next_batch(self.loader_iters)
000352 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning/pytorch/trainer/supporters.py", line 581, in request_next_batch
000353 return apply_to_collection(loader_iters, Iterator, next)
000354 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/lightning_utilities/core/apply_func.py", line 64, in apply_to_collection
000355 return function(data, *args, **kwargs)
000356 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in next
000357 data = self._next_data()
000358 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
000359 idx, data = self._get_data()
000360 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1277, in _get_data
000361 success, data = self._try_get_data(self._timeout)
000362 File "/home/yoels/miniconda3/envs/ai-isp/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1145, in _try_get_data
000363 raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
000364 RuntimeError: DataLoader worker (pid(s) 16143) exited unexpectedly

I get these errors only when I train with both datasets (validation test works OK). I wrote a collate function myself, but this occurs often even when I num_workers = 0, batch_size=1 I haven't checked too many other datasets recently so I'll retry that.

Answer 6 · 2024-10-15T07:27:09.000Z

OK, thank you. There's not much to see in the stack trace.

Unfortunately, I also don't know what is the cause of this problem. Here are a few suggestions that may help:

Try to train with only the homography dataset alone; does the error disappear?
Train with other two standard datasets (e.g., sintel+chairs) to see if the error is caused by using multiple datasets
Try to print the file path during the loading to see if it is a particular file that is causing the problem

Answer 7 · 2024-10-15T14:58:33.000Z

hi

when I train with Homography dataset only in train but verify with homography + sintel it runs (has been training for a day or so now).
when I train with other datasets, I seem to still get a similar error but not in the same place:
for the run command: train.py with: -A neuflow --train_dataset autoflow+homography --val_dataset sintel+chairs --train_num_workers 96 --train_batch_size 8
DataLoader worker (pid 31543) is killed by signal: Killed.
File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 35, in forward
return self.norm(x1 + x2)
File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 128, in forward
x2 = self.block2(img)
File "/.../ptlflow/ptlflow/models/neuflow/neuflow.py", line 145, in forward
feature0_s8, feature0_s16 = self.backbone(img0)
File "/.../ptlflow/ptlflow/models/base_model/base_model.py", line 411, in training_step
preds = self(batch)

RuntimeError: DataLoader worker (pid 31543) is killed by signal: Killed.
I don't thinks it's a memory issue but it could be, I'm running on a single RTX A6000 + 16 CPUs with 64 GB RAM overall.

as it appears even when I switch datasets, but the error happens when the entire batch is also from the same dataset.
It can switch location a bit, but the error is always that a worker was killed.

Do you have any further ideas?
Thanks

Answer 8 · 2024-10-15T23:11:20.000Z

Have you tried to use sintel+chairs in the --train_dataset as well? I also wonder if --train_num_workers 96 is not creating too many threads, which could be causing the process killing.

…

On Tue, Oct 15, 2024 at 10:58 PM yoel sanders ***@***.***> wrote: hi - when I train with Homography dataset only in train but verify with homography + sintel it runs (has been training for a day or so now). - when I train with other datasets, I seem to still get a similar error but not in the same place: - for the run command: train.py with: -A neuflow --train_dataset autoflow+homography --val_dataset sintel+chairs --train_num_workers 96 --train_batch_size 8 - DataLoader worker (pid 31543) is killed by signal: Killed. File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 35, in forward return self.norm(x1 + x2) File "/.../ptlflow/ptlflow/models/neuflow/backbone.py", line 128, in forward x2 = self.block2(img) File "/.../ptlflow/ptlflow/models/neuflow/neuflow.py", line 145, in forward feature0_s8, feature0_s16 = self.backbone(img0) File "/.../ptlflow/ptlflow/models/base_model/base_model.py", line 411, in training_step preds = self(batch) RuntimeError: DataLoader worker (pid 31543) is killed by signal: Killed. I don't thinks it's a memory issue but it could be, I'm running on a single RTX A6000 + 16 CPUs with 64 GB RAM overall. - as it appears even when I switch datasets, but the error happens when the entire batch is also from the same dataset. It can switch location a bit, but the error is always that a worker was killed. Do you have any further ideas? Thanks — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF2KE3KXRAUTYEV3EDJM3ZLZ3UUS7AVCNFSM6AAAAABP4QXQY6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJUGE3TSOBRGM> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 9 · 2024-10-16T11:21:19.000Z

Hi, so after multiple attempts, it seems as follows:
If I run the same datasets for train + validation, sintel+homography together, then it runs the train.py and I don't get the signal kill error (at least not yet, after 1 hour).
Not sure the cause, will keep you updated if I find something out.
Thanks

Answer 10 · 2024-11-10T08:04:19.000Z

hi for the sake of closure for anyone finding this error- it's just a lack of memory being reported, so if I run out of CPU memory it kills the process
yoel