Error while running 2_run_pbt.py in multisession tutorial.

Question

Error while running 2_run_pbt.py in multisession tutorial.

Closed this issue 8 months ago · 4 comments

Hi Andrew,

I am currently running a multisession tutorial in jupyter notebook on a windows 11 64bit PC with lfads_torch installed.
It was working fine until 1_data_prep.ipynb, but while running 2_run_pbt.py the following error occurred.
I would appreciate it if you could tell me how to deal with it.

2024-01-02 19:01:01,090 ERROR serialization.py:371 -- Failed to unpickle serialized exception
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
return pickle.loads(ray_exception.serialized_exception)
TypeError: init() missing 1 required positional argument: 'missing_cfg_file'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 275, in _deserialize_object
return RayError.from_bytes(obj)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 40, in from_bytes
return RayError.from_ray_exception(ray_exception)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception
2024-01-02 19:01:01,092 ERROR trial_runner.py:993 -- Trial run_model_c9f12_00001: Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\execution\ray_trial_executor.py", line 1050, in get_next_executor_event
future_result = ray.get(ready_future)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\worker.py", line 2291, in get
raise value
ray.exceptions.RaySystemError: System error: Failed to unpickle serialized exception
traceback: Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 46, in from_ray_exception
return pickle.loads(ray_exception.serialized_exception)
TypeError: init() missing 1 required positional argument: 'missing_cfg_file'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 369, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\serialization.py", line 275, in _deserialize_object
return RayError.from_bytes(obj)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 40, in from_bytes
return RayError.from_ray_exception(ray_exception)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\exceptions.py", line 49, in from_ray_exception
raise RuntimeError(msg) from e
RuntimeError: Failed to unpickle serialized exception

Thanks.

Answer 1 · 2024-01-07T16:18:09.000Z

I updated the hydra version and the above error was resolved.
However, I now get another error. It appears that the dataset for validation is empty. Which parameter should I adjust for this?

2024-01-08 01:12:19,959 ERROR trial_runner.py:993 -- Trial run_model_7ac64_00001: Error processing event.
ray.exceptions.RayTaskError(ZeroDivisionError): ray::ImplicitFunc.train() (pid=3740, ip=127.0.0.1, repr=run_model)
File "python\ray_raylet.pyx", line 859, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 863, in ray._raylet.execute_task
File "python\ray_raylet.pyx", line 810, in ray._raylet.execute_task.function_executor
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray_private\function_manager.py", line 674, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\trainable.py", line 355, in train
raise skipped from exception_cause(skipped)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py", line 325, in entrypoint
return self._trainable_func(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
return method(self, *_args, **_kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\function_trainable.py", line 651, in _trainable_func
output = fn()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\ray\tune\trainable\util.py", line 365, in inner
trainable(config, **fn_kwargs)
File "c:\windows\system32\lfads-torch\lfads_torch\run_model.py", line 78, in run_model
trainer.fit(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 724, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1237, in _run
results = self._run_stage()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1324, in _run_stage
return self._run_train()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1346, in _run_train
self._run_sanity_check()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1407, in _run_sanity_check
val_loop._reload_evaluation_dataloaders()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 239, in _reload_evaluation_dataloaders
self.trainer.reset_val_dataloader()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1959, in reset_val_dataloader
self.num_val_batches, self.val_dataloaders = self._data_connector._reset_eval_dataloader(
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py", line 372, in _reset_eval_dataloader
dataloaders = self._request_dataloader(mode, model=model)
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py", line 451, in _request_dataloader
dataloader = source.dataloader()
File "C:\ProgramData\Anaconda3\envs\lfads-torch\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py", line 527, in dataloader
return method()
File "c:\windows\system32\lfads-torch\lfads_torch\datamodules.py", line 187, in val_dataloader
batch_size = int(self.hparams.batch_size / len(self.valid_ds))
ZeroDivisionError: division by zero

Thanks.

Answer 2 · 2024-01-07T17:06:03.000Z

Hello! Glad you were able to resolve the first issue. Just to be clear, did you end up using a different environment from what was specified in the install instructions / requirements.txt?

For the second error, len(self.valid_ds) should display the number of sessions found, so the datamodule may not be finding your files. Could you check the paths to your datafiles and confirm that datamodule.datafile_pattern matches them?

Answer 3 · 2024-01-07T17:26:16.000Z

The second error also seems to have resolved itself.
The cause was that the directories of the created train and valid datasets were incorrectly specified.

As for what I finally did,

I updated the version of hydra to the latest version.
Created a wandb account and added a sentence to log in.
Set the mode of wandb to offline (Because it was timeout).

Now calculating rouse_multisession for the tutorial.
If I get another error, I would appreciate your comments.

Thanks a lot.

Answer 4 · 2024-01-07T21:09:47.000Z

Sure thing, feel free to reopen if anything else comes up!