arsedler9/lfads-torch

Error while running 2_run_pbt.py in multisession tutorial.

Closed this issue · 4 comments

Hi Andrew,

I am currently running a multisession tutorial in jupyter notebook on a windows 11 64bit PC with lfads_torch installed.
It was working fine until 1_data_prep.ipynb, but while running 2_run_pbt.py the following error occurred.
I would appreciate it if you could tell me how to deal with it.

2024-01-02 19:01:01,090 ERROR serialization.py:371 -- Failed to unpickle serialized exception
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
return pickle.loads(ray_exception.serialized_exception)
TypeError: init() missing 1 required positional argument: 'missing_cfg_file'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 275, in _deserialize_object
return RayError.from_bytes(obj)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 40, in from_bytes
return RayError.from_ray_exception(ray_exception)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
2024-01-02 19:01:01,092 ERROR trial_runner.py:993 -- Trial run_model_c9f12_00001: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\execution\ray_trial_executor.py", line 1050, in get_next_executor_event
future_result = ray.get(ready_future)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\worker.py", line 2291, in get
raise value
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
return pickle.loads(ray_exception.serialized_exception)
TypeError: init() missing 1 required positional argument: 'missing_cfg_file'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 275, in _deserialize_object
return RayError.from_bytes(obj)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 40, in from_bytes
return RayError.from_ray_exception(ray_exception)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

Thanks.

I updated the hydra version and the above error was resolved.
However, I now get another error. It appears that the dataset for validation is empty. Which parameter should I adjust for this?

2024-01-08 01:12:19,959 ERROR trial_runner.py:993 -- Trial run_model_7ac64_00001: Error processing event.
ray.exceptions.RayTaskError(ZeroDivisionError): ray::ImplicitFunc.train() (pid=3740, ip=127.0.0.1, repr=run_model)
File "python\ray_raylet.pyx", line 859, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 863, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 810, in ray._raylet.execute_task.function_executor
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\function_manager.py", line 674, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\trainable.py", line 355, in train
raise skipped from exception_cause(skipped)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py", line 325, in entrypoint
return self._trainable_func(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py", line 651, in _trainable_func
output = fn()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\util.py", line 365, in inner
trainable(config, **fn_kwargs)
File "c:\windows\system32\lfads-torch\lfads_torch\run_model.py", line 78, in run_model
trainer.fit(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1237, in _run
results = self._run_stage()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1324, in _run_stage
return self._run_train()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1346, in _run_train
self._run_sanity_check()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1407, in _run_sanity_check
val_loop._reload_evaluation_dataloaders()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 239, in _reload_evaluation_dataloaders
self.trainer.reset_val_dataloader()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1959, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py", line 372, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode, model=model)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py", line 451, in _request_dataloader
dataloader = source.dataloader()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py", line 527, in dataloader
return method()
File "c:\windows\system32\lfads-torch\lfads_torch\datamodules.py", line 187, in val_dataloader
batch_size = int(self.hparams.batch_size / len(self.valid_ds))
ZeroDivisionError: division by zero

Thanks.

Hello! Glad you were able to resolve the first issue. Just to be clear, did you end up using a different environment from what was specified in the install instructions / requirements.txt?

For the second error, len(self.valid_ds) should display the number of sessions found, so the datamodule may not be finding your files. Could you check the paths to your datafiles and confirm that datamodule.datafile_pattern matches them?

The second error also seems to have resolved itself.
The cause was that the directories of the created train and valid datasets were incorrectly specified.

As for what I finally did,

  1. I updated the version of hydra to the latest version.
  2. Created a wandb account and added a sentence to log in.
  3. Set the mode of wandb to offline (Because it was timeout).

Now calculating rouse_multisession for the tutorial.
If I get another error, I would appreciate your comments.

Thanks a lot.

Sure thing, feel free to reopen if anything else comes up!