victorchall/EveryDream-trainer

ckpt file not saving when training has finished.

Meathelix1 opened this issue · 2 comments

Once the training has finished and it goes to save the ckpt file it will tend to use all system RAM and not save the file.

Windows 10 | 16 Core AMD | 32G RAM | 3090

`Training halted. Summoning checkpoint as last.ckpt
Training complete. max_steps or max_epochs reached, or we blew up.

Traceback (most recent call last):
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 604, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 380, in save
return
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 259, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\caffe2\serialize\inline_container.cc:319] . unexpected pos 9926808960 vs 9926808856

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 754, in
trainer.fit(model, data)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1236, in _run
results = self._run_stage()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1323, in _run_stage
return self._run_train()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1353, in _run_train
self.fit_loop.run()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\loops\base.py", line 205, in run
self.on_advance_end()
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 294, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1636, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 308, in on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 379, in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 651, in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 702, in _update_best_and_save
self._save_checkpoint(trainer, filepath)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py", line 384, in _save_checkpoint
trainer.save_checkpoint(filepath, self.save_weights_only)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 2467, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\connectors\checkpoint_connector.py", line 445, in save_checkpoint
self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 418, in save_checkpoint
self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\plugins\io\torch_plugin.py", line 54, in save_checkpoint
atomic_save(checkpoint, path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\utilities\cloud_io.py", line 67, in atomic_save
torch.save(checkpoint, bytesbuffer)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 381, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 225, in exit
self.file_like.flush()
ValueError: I/O operation on closed file.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 604, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 380, in save
return
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 259, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\caffe2\serialize\inline_container.cc:319] . unexpected pos 4731266176 vs 4731266076

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "main.py", line 756, in
melk()
File "main.py", line 733, in melk
trainer.save_checkpoint(ckpt_path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 2467, in save_checkpoint
self._checkpoint_connector.save_checkpoint(filepath, weights_only=weights_only, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\trainer\connectors\checkpoint_connector.py", line 445, in save_checkpoint
self.trainer.strategy.save_checkpoint(_checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 418, in save_checkpoint
self.checkpoint_io.save_checkpoint(checkpoint, filepath, storage_options=storage_options)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\plugins\io\torch_plugin.py", line 54, in save_checkpoint
atomic_save(checkpoint, path)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\pytorch_lightning\utilities\cloud_io.py", line 67, in atomic_save
torch.save(checkpoint, bytesbuffer)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 381, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File "C:\Users\User01\anaconda3\envs\everydream\lib\site-packages\torch\serialization.py", line 225, in exit
self.file_like.flush()
ValueError: I/O operation on closed file.
`

I'm getting the same thing, doesn't make it past the first epoch, but a last.cpkt did get saved for me...

Similar specs aside from intel cpu and 3090ti and was training successfully until a couple days ago...

You'll need to increase your swap file size, saving the checkpoint requires a really large system ram allocation, increasing swap file size will solve this. You can either set a 64GB+ swap or set it to "system managed" on a drive with a lot of free space.