[HANDS-ON BUG] Unit8 part2 - getting TypeError: cannot pickle 'TLSBuffer' object
TirumaleshT opened this issue · 8 comments
Describe the bug
A clear and concise description of what the bug is.
Please share your notebook link so that we can reproduce the error
When I run below cell I am getting TLSBuffer error
Start the training, this should take around 15 minutes
register_vizdoom_components()
The scenario we train on today is health gathering
other scenarios include "doom_basic", "doom_two_colors_easy", "doom_dm", "doom_dwango5", "doom_my_way_home", "doom_deadly_corridor", "doom_defend_the_center", "doom_defend_the_line"
env = "doom_health_gathering_supreme"
cfg = parse_vizdoom_cfg(argv=[f"--env={env}", "--num_workers=8", "--num_envs_per_worker=4", "--train_for_env_steps=4000000"])
status = run_rl(cfg)
Error:
[2023-07-03 11:15:27,241][00301] register_encoder_factory: <function make_vizdoom_encoder at 0x7f91128f4f70>
[2023-07-03 11:15:27,254][00301] Saved parameter configuration for experiment default_experiment not found!
[2023-07-03 11:15:27,255][00301] Starting experiment from scratch!
[2023-07-03 11:15:27,266][00301] Experiment dir /content/train_dir/default_experiment already exists!
[2023-07-03 11:15:27,267][00301] Resuming existing experiment from /content/train_dir/default_experiment...
[2023-07-03 11:15:27,270][00301] Weights and Biases integration disabled
[2023-07-03 11:15:29,460][00301] Queried available GPUs: 0
[2023-07-03 11:15:29,461][00301] Environment var CUDA_VISIBLE_DEVICES is 0
[2023-07-03 11:15:31,676][00301] Starting experiment with the following configuration:
help=False
algo=APPO
env=doom_health_gathering_supreme
experiment=default_experiment
train_dir=/content/train_dir
restart_behavior=resume
device=gpu
seed=None
num_policies=1
async_rl=True
serial_mode=False
batched_sampling=False
num_batches_to_accumulate=2
worker_num_splits=2
policy_workers_per_policy=1
max_policy_lag=1000
num_workers=8
num_envs_per_worker=4
batch_size=1024
num_batches_per_epoch=1
num_epochs=1
rollout=32
recurrence=32
shuffle_minibatches=False
gamma=0.99
reward_scale=1.0
reward_clip=1000.0
value_bootstrap=False
normalize_returns=True
exploration_loss_coeff=0.001
value_loss_coeff=0.5
kl_loss_coeff=0.0
exploration_loss=symmetric_kl
gae_lambda=0.95
ppo_clip_ratio=0.1
ppo_clip_value=0.2
with_vtrace=False
vtrace_rho=1.0
vtrace_c=1.0
optimizer=adam
adam_eps=1e-06
adam_beta1=0.9
adam_beta2=0.999
max_grad_norm=4.0
learning_rate=0.0001
lr_schedule=constant
lr_schedule_kl_threshold=0.008
obs_subtract_mean=0.0
obs_scale=255.0
normalize_input=True
normalize_input_keys=None
decorrelate_experience_max_seconds=0
decorrelate_envs_on_one_worker=True
actor_worker_gpus=[]
set_workers_cpu_affinity=True
force_envs_single_thread=False
default_niceness=0
log_to_file=True
experiment_summaries_interval=10
flush_summaries_interval=30
stats_avg=100
summaries_use_frameskip=True
heartbeat_interval=20
heartbeat_reporting_interval=600
train_for_env_steps=4000000
train_for_seconds=10000000000
save_every_sec=120
keep_checkpoints=2
load_checkpoint_kind=latest
save_milestones_sec=-1
save_best_every_sec=5
save_best_metric=reward
save_best_after=100000
benchmark=False
encoder_mlp_layers=[512, 512]
encoder_conv_architecture=convnet_simple
encoder_conv_mlp_layers=[512]
use_rnn=True
rnn_size=512
rnn_type=gru
rnn_num_layers=1
decoder_mlp_layers=[]
nonlinearity=elu
policy_initialization=orthogonal
policy_init_gain=1.0
actor_critic_share_weights=True
adaptive_stddev=True
continuous_tanh_scale=0.0
initial_stddev=1.0
use_env_info_cache=False
env_gpu_actions=False
env_gpu_observations=True
env_frameskip=4
env_framestack=1
pixel_format=CHW
use_record_episode_statistics=False
with_wandb=False
wandb_user=None
wandb_project=sample_factory
wandb_group=None
wandb_job_type=SF
wandb_tags=[]
with_pbt=False
pbt_mix_policies_in_one_env=True
pbt_period_env_steps=5000000
pbt_start_mutation=20000000
pbt_replace_fraction=0.3
pbt_mutation_rate=0.15
pbt_replace_reward_gap=0.1
pbt_replace_reward_gap_absolute=1e-06
pbt_optimize_gamma=False
pbt_target_objective=true_objective
pbt_perturb_min=1.1
pbt_perturb_max=1.5
num_agents=-1
num_humans=0
num_bots=-1
start_bot_difficulty=None
timelimit=None
res_w=128
res_h=72
wide_aspect_ratio=False
eval_env_frameskip=1
fps=35
command_line=--env=doom_health_gathering_supreme --num_workers=8 --num_envs_per_worker=4 --train_for_env_steps=4000000
cli_args={'env': 'doom_health_gathering_supreme', 'num_workers': 8, 'num_envs_per_worker': 4, 'train_for_env_steps': 4000000}
git_hash=unknown
git_repo_name=not a git repository
train_script=.usr.local.lib.python3.10.dist-packages.ipykernel_launcher
[2023-07-03 11:15:31,681][00301] Saving configuration to /content/train_dir/default_experiment/config.json...
[2023-07-03 11:15:31,686][00301] Rollout worker 0 uses device cpu
[2023-07-03 11:15:31,687][00301] Rollout worker 1 uses device cpu
[2023-07-03 11:15:31,691][00301] Rollout worker 2 uses device cpu
[2023-07-03 11:15:31,692][00301] Rollout worker 3 uses device cpu
[2023-07-03 11:15:31,696][00301] Rollout worker 4 uses device cpu
[2023-07-03 11:15:31,697][00301] Rollout worker 5 uses device cpu
[2023-07-03 11:15:31,699][00301] Rollout worker 6 uses device cpu
[2023-07-03 11:15:31,700][00301] Rollout worker 7 uses device cpu
[2023-07-03 11:15:31,907][00301] Using GPUs [0] for process 0 (actually maps to GPUs [0])
[2023-07-03 11:15:31,911][00301] InferenceWorker_p0-w0: min num requests: 2
[2023-07-03 11:15:31,954][00301] Starting all processes...
[2023-07-03 11:15:31,959][00301] Starting process learner_proc0
[2023-07-03 11:15:31,962][00301] EvtLoop [Runner_EvtLoop, process=main process 301] unhandled exception in slot='_on_start' connected to emitter=Emitter(object_id='Runner_EvtLoop', signal_name='start'), args=()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 355, in _process_signal
slot_callable(*args)
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 49, in _on_start
self._start_processes()
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 56, in _start_processes
p.start()
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 515, in start
self._process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'TLSBuffer' object
[2023-07-03 11:15:31,968][00301] Unhandled exception cannot pickle 'TLSBuffer' object in evt loop Runner_EvtLoop
[2023-07-03 11:15:31,970][00301] Uncaught exception in Runner evt loop
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner.py", line 770, in run
evt_loop_status = self.event_loop.exec()
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 403, in exec
raise exc
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 399, in exec
while self._loop_iteration():
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 383, in _loop_iteration
self._process_signal(s)
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 358, in _process_signal
raise exc
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 355, in _process_signal
slot_callable(*args)
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 49, in _on_start
self._start_processes()
File "/usr/local/lib/python3.10/dist-packages/sample_factory/algo/runners/runner_parallel.py", line 56, in _start_processes
p.start()
File "/usr/local/lib/python3.10/dist-packages/signal_slot/signal_slot.py", line 515, in start
self._process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'TLSBuffer' object
[2023-07-03 11:15:31,972][00301] Runner profile tree view:
main_loop: 0.0185
[2023-07-03 11:15:31,973][00301] Collected {}, FPS: 0.0
Material
- Did you use Google Colab?
- yes
If not:
- Your Operating system (OS)
- Version of your OS
I'm facing the same issue.
Me too
@simoninithomas - can you please help us in resolving this issue
Also had the same issue.
This isn't a proper fix, but it did let me get the model trained and results pushed.
- run all cells up to and including the training cell
- get TLSBuffer error
- add a new cell and re-install sample factory
!pip install sample-factory==2.0.2
- restart runtime, don't rerun any cells
- run cell that defines
register_vizdoom_envs
- now run training cell again
Thanks @joeADSP for the work around for the issue. I am able to train the model and push the results.
@joeADSP Thanks for the solution :) i realized if you just move the !pip install sample-factory==2.0.2 to the next cell, you will not need to do any of the above steps you mentioned, still thanks alot, you inspired me
Hey there thanks for the advice we update the colab 🤗