huggingface/deep-rl-class

[HANDS-ON BUG] Unit8 part2 - getting TypeError: cannot pickle 'TLSBuffer' object

TirumaleshT opened this issue · 8 comments

Describe the bug

A clear and concise description of what the bug is.
Please share your notebook link so that we can reproduce the error
When I run below cell I am getting TLSBuffer error

Start the training, this should take around 15 minutes

register_vizdoom_components()

The scenario we train on today is health gathering

other scenarios include "doom_basic", "doom_two_colors_easy", "doom_dm", "doom_dwango5", "doom_my_way_home", "doom_deadly_corridor", "doom_defend_the_center", "doom_defend_the_line"

env = "doom_health_gathering_supreme"
cfg = parse_vizdoom_cfg(argv=[f"--env={env}", "--num_workers=8", "--num_envs_per_worker=4", "--train_for_env_steps=4000000"])

status = run_rl(cfg)

Error:
[2023-07-03 11:15:27,241][00301] register_encoder_factory: <function make_vizdoom_encoder at 0x7f91128f4f70>
[2023-07-03 11:15:27,254][00301] Saved parameter configuration for experiment default_experiment not found!
[2023-07-03 11:15:27,255][00301] Starting experiment from scratch!
[2023-07-03 11:15:27,266][00301] Experiment dir /content/train_dir/default_experiment already exists!
[2023-07-03 11:15:27,267][00301] Resuming existing experiment from /content/train_dir/default_experiment...
[2023-07-03 11:15:27,270][00301] Weights and Biases integration disabled
[2023-07-03 11:15:29,460][00301] Queried available GPUs: 0

[2023-07-03 11:15:29,461][00301] Environment var CUDA_VISIBLE_DEVICES is 0

[2023-07-03 11:15:31,676][00301] Starting experiment with the following configuration:
help=False
algo=APPO
env=doom_health_gathering_supreme
experiment=default_experiment
train_dir=/content/train_dir
restart_behavior=resume
device=gpu
seed=None
num_policies=1
async_rl=True
serial_mode=False
batched_sampling=False
num_batches_to_accumulate=2
worker_num_splits=2
policy_workers_per_policy=1
max_policy_lag=1000
num_workers=8
num_envs_per_worker=4
batch_size=1024
num_batches_per_epoch=1
num_epochs=1
rollout=32
recurrence=32
shuffle_minibatches=False
gamma=0.99
reward_scale=1.0
reward_clip=1000.0
value_bootstrap=False
normalize_returns=True
exploration_loss_coeff=0.001
value_loss_coeff=0.5
kl_loss_coeff=0.0
exploration_loss=symmetric_kl
gae_lambda=0.95
ppo_clip_ratio=0.1
ppo_clip_value=0.2
with_vtrace=False
vtrace_rho=1.0
vtrace_c=1.0
optimizer=adam
adam_eps=1e-06
adam_beta1=0.9
adam_beta2=0.999
max_grad_norm=4.0
learning_rate=0.0001
lr_schedule=constant
lr_schedule_kl_threshold=0.008
obs_subtract_mean=0.0
obs_scale=255.0
normalize_input=True
normalize_input_keys=None
decorrelate_experience_max_seconds=0
decorrelate_envs_on_one_worker=True
actor_worker_gpus=[]
set_workers_cpu_affinity=True
force_envs_single_thread=False
default_niceness=0
log_to_file=True
experiment_summaries_interval=10
flush_summaries_interval=30
stats_avg=100
summaries_use_frameskip=True
heartbeat_interval=20
heartbeat_reporting_interval=600
train_for_env_steps=4000000
train_for_seconds=10000000000
save_every_sec=120
keep_checkpoints=2
load_checkpoint_kind=latest
save_milestones_sec=-1
save_best_every_sec=5
save_best_metric=reward
save_best_after=100000
benchmark=False
encoder_mlp_layers=[512, 512]
encoder_conv_architecture=convnet_simple
encoder_conv_mlp_layers=[512]
use_rnn=True
rnn_size=512
rnn_type=gru
rnn_num_layers=1
decoder_mlp_layers=[]
nonlinearity=elu
policy_initialization=orthogonal
policy_init_gain=1.0
actor_critic_share_weights=True
adaptive_stddev=True
continuous_tanh_scale=0.0
initial_stddev=1.0
use_env_info_cache=False
env_gpu_actions=False
env_gpu_observations=True
env_frameskip=4
env_framestack=1
pixel_format=CHW
use_record_episode_statistics=False
with_wandb=False
wandb_user=None
wandb_project=sample_factory
wandb_group=None
wandb_job_type=SF
wandb_tags=[]
with_pbt=False
pbt_mix_policies_in_one_env=True
pbt_period_env_steps=5000000
pbt_start_mutation=20000000
pbt_replace_fraction=0.3
pbt_mutation_rate=0.15
pbt_replace_reward_gap=0.1
pbt_replace_reward_gap_absolute=1e-06
pbt_optimize_gamma=False
pbt_target_objective=true_objective
pbt_perturb_min=1.1
pbt_perturb_max=1.5
num_agents=-1
num_humans=0
num_bots=-1
start_bot_difficulty=None
timelimit=None
res_w=128
res_h=72
wide_aspect_ratio=False
eval_env_frameskip=1
fps=35
command_line=--env=doom_health_gathering_supreme --num_workers=8 --num_envs_per_worker=4 --train_for_env_steps=4000000
cli_args={'env': 'doom_health_gathering_supreme', 'num_workers': 8, 'num_envs_per_worker': 4, 'train_for_env_steps': 4000000}
git_hash=unknown
git_repo_name=not a git repository
train_script=.usr.local.lib.python3.10.dist-packages.ipykernel_launcher
[2023-07-03 11:15:31,681][00301] Saving configuration to /content/train_dir/default_experiment/config.json...
[2023-07-03 11:15:31,686][00301] Rollout worker 0 uses device cpu
[2023-07-03 11:15:31,687][00301] Rollout worker 1 uses device cpu
[2023-07-03 11:15:31,691][00301] Rollout worker 2 uses device cpu
[2023-07-03 11:15:31,692][00301] Rollout worker 3 uses device cpu
[2023-07-03 11:15:31,696][00301] Rollout worker 4 uses device cpu
[2023-07-03 11:15:31,697][00301] Rollout worker 5 uses device cpu
[2023-07-03 11:15:31,699][00301] Rollout worker 6 uses device cpu
[2023-07-03 11:15:31,700][00301] Rollout worker 7 uses device cpu
[2023-07-03 11:15:31,907][00301] Using GPUs [0] for process 0 (actually maps to GPUs [0])
[2023-07-03 11:15:31,911][00301] InferenceWorker_p0-w0: min num requests: 2
[2023-07-03 11:15:31,954][00301] Starting all processes...
[2023-07-03 11:15:31,959][00301] Starting process learner_proc0
[2023-07-03 11:15:31,962][00301] EvtLoop [Runner_EvtLoop, process=main process 301] unhandled exception in slot='_on_start' connected to emitter=Emitter(object_id='Runner_EvtLoop', signal_name='start'), args=()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 355, in _process_signal
slot_callable(*args)
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 49, in _on_start
self._start_processes()
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 56, in _start_processes
p.start()
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 515, in start
self._process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'TLSBuffer' object
[2023-07-03 11:15:31,968][00301] Unhandled exception cannot pickle 'TLSBuffer' object in evt loop Runner_EvtLoop
[2023-07-03 11:15:31,970][00301] Uncaught exception in Runner evt loop
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner.py", line 770, in run
evt_loop_status = self.event_loop.exec()
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 403, in exec
raise exc
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 399, in exec
while self._loop_iteration():
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 383, in _loop_iteration
self._process_signal(s)
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 358, in _process_signal
raise exc
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 355, in _process_signal
slot_callable(*args)
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 49, in _on_start
self._start_processes()
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 56, in _start_processes
p.start()
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 515, in start
self._process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'TLSBuffer' object
[2023-07-03 11:15:31,972][00301] Runner profile tree view:
main_loop: 0.0185
[2023-07-03 11:15:31,973][00301] Collected {}, FPS: 0.0

Material

  • Did you use Google Colab?
  • yes

If not:

  • Your Operating system (OS)
  • Version of your OS

I'm facing the same issue.

Me too

@simoninithomas - can you please help us in resolving this issue

Also had the same issue.

This isn't a proper fix, but it did let me get the model trained and results pushed.

  • run all cells up to and including the training cell
  • get TLSBuffer error
  • add a new cell and re-install sample factory !pip install sample-factory==2.0.2
  • restart runtime, don't rerun any cells
  • run cell that defines register_vizdoom_envs
  • now run training cell again

Thanks @joeADSP for the work around for the issue. I am able to train the model and push the results.

wiss84 commented

@joeADSP Thanks for the solution :) i realized if you just move the !pip install sample-factory==2.0.2 to the next cell, you will not need to do any of the above steps you mentioned, still thanks alot, you inspired me

Ah great @wiss84 that's much simpler!

Hey there thanks for the advice we update the colab 🤗