rail-berkeley/softlearning

`dm_control` `cheetah` `run` training stops suddenly

letusfly85 opened this issue · 3 comments

Hi, I'm now trying to execute dm_control walker walk, walker run, and cheetah run.

Two walker walk, walker run work fine, however cheetah run fails during training like below...

Failure message

Number of errored trials: 1
+--------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name               |   # failures | error file                                                                                                                                                                                                                                              |
|--------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| id=43fbb_00000-seed=8373 |            4 | /home/acc12468eh/ray_results/dm_control/cheetah/run/2020-09-14T19-51-36-sl-sac/id=43fbb_00000-seed=8373_0_hidden_layer_sizes=(256, 256),preprocessors=({'pixels': {'class_name': 'convnet_preprocessor', 'config'_2020-09-14_19-51-38hsvhe5yt/error.txt |
+--------------------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
  File "/home/acc12468eh/miniconda3/envs/softlearning/bin/softlearning", line 11, in <module>
    load_entry_point('softlearning', 'console_scripts', 'softlearning')()
  File "/home/acc12468eh/softlearning/softlearning/scripts/console_scripts.py", line 207, in main
    return cli()
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/acc12468eh/softlearning/softlearning/scripts/console_scripts.py", line 73, in run_example_local_cmd
    return run_example_local(example_module_name, example_argv)
  File "/home/acc12468eh/softlearning/examples/instrument.py", line 244, in run_example_local
    reuse_actors=True)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/tune.py", line 356, in run
    raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [id=43fbb_00000-seed=8373])

And I cat the error.txt something like that I found.

(base) [acc12468eh@es2 ~]$ cat /home/acc12468eh/ray_results/dm_control/cheetah/run/2020-09-14T19-51-36-sl-sac/id=43fbb_00000-seed=8373_0_hidden_layer_sizes=(256, 256),preprocessors=({'pixels': {'class_name': 'convnet_preprocessor', 'config'_2020-09-14_19-51-38hsvhe5yt/error.txt

Content of error.txt

-bash: unexpected token `('

Thank you.

I think what you're actually seeing is not the contents of error.txt but rather an error from bash. Can you wrap the cat argument in quotes? I.e.:

cat "/home/acc12468eh/ray_results/dm_control/cheetah/run/2020-09-14T19-51-36-sl-sac/id=43fbb_00000-seed=8373_0_hidden_layer_sizes=(256, 256),preprocessors=({'pixels': {'class_name': 'convnet_preprocessor', 'config'_2020-09-14_19-51-38hsvhe5yt/error.txt"

@hartikainen

Oh, sorry. This is the correct error.txt content.

(base) [acc12468eh@es2 ~]$ cat /home/acc12468eh/ray_results/dm_control/cheetah/run/2020-09-14T19-51-36-sl-sac/id\=43fbb_00000-seed\=8373_0_hidden_layer_sizes\=\(256\,\ 256\)\,preprocessors\=\(\{\'pixels\'\:\ \{\'class_name\'\:\ \'convnet_preprocessor\'\,\ \'config\'_2020-09-14_19-51-38hsvhe5yt/error.txt
Failure # 1 (occurred at 2020-09-14_19-51-58)
Traceback (most recent call last):
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/worker.py", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Failure # 2 (occurred at 2020-09-14_19-52-06)
Traceback (most recent call last):
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/worker.py", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Failure # 3 (occurred at 2020-09-14_19-52-14)
Traceback (most recent call last):
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/worker.py", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Failure # 4 (occurred at 2020-09-14_19-52-23)
Traceback (most recent call last):
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/acc12468eh/miniconda3/envs/softlearning/lib/python3.7/site-packages/ray/worker.py", line 1540, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

Yes, walker run is okay, but not cheetah run.