Crashing with seccomp-bpf failure in syscall 2030
norbertkeresztes opened this issue · 2 comments
Describe the bug
I'm going through the guides on the website and running dqn_cartpole both in dev and train results in slow runs, high resource usage and it ends with several duplication of this error message:
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0230
To Reproduce
- OS and environment: Ubuntu 20.04.1
- SLM Lab git SHA (run
git rev-parse HEAD
to get it): faca82c spec
file used: slm_lab/spec/demo.json
Additional context
Running on AMD TR 3990X and all 128 CPU cores are running above 90% during the run (only checked for train, not dev). These are the training metrics logged during one of the run:
[2021-07-24 10:34:11,503 PID:27435 INFO logger.py info] Running RL loop for trial 0 session 1
[2021-07-24 10:34:11,506 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 0 t: 0 wall_t: 0 opt_step: 0 frame: 0 fps: 0 total_reward: nan total_reward_ma: nan loss: nan lr: 0.02 explore_var: 1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:36:42,739 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 28 t: 4 wall_t: 151 opt_step: 18720 frame: 500 fps: 3.31126 total_reward: 9 total_reward_ma: 9 loss: 0.0168248 lr: 0.02 explore_var: 0.55 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:39:24,086 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 54 t: 1 wall_t: 312 opt_step: 38720 frame: 1000 fps: 3.20513 total_reward: 10 total_reward_ma: 9.5 loss: 0.0740117 lr: 0.02 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:39:24,095 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 9.5 strength: -12.36 max_strength: -11.86 final_strength: -11.86 sample_efficiency: 0.00152023 training_efficiency: 4.01807e-05 stability: 1
[2021-07-24 10:42:06,057 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 74 t: 72 wall_t: 474 opt_step: 58720 frame: 1500 fps: 3.16456 total_reward: 21 total_reward_ma: 13.3333 loss: 0.299671 lr: 0.018 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:42:06,069 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 13.3333 strength: -8.52667 max_strength: -0.860001 final_strength: -0.860001 sample_efficiency: 0.00149153 training_efficiency: 3.94024e-05 stability: 1
[2021-07-24 10:44:47,405 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 84 t: 68 wall_t: 635 opt_step: 78720 frame: 2000 fps: 3.14961 total_reward: 69 total_reward_ma: 27.25 loss: 0.231153 lr: 0.018 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:44:47,413 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 27.25 strength: 5.39 max_strength: 47.14 final_strength: 47.14 sample_efficiency: -0.000676407 training_efficiency: -1.89741e-05 stability: 1
[2021-07-24 10:47:30,196 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 90 t: 35 wall_t: 798 opt_step: 98720 frame: 2500 fps: 3.13283 total_reward: 138 total_reward_ma: 49.4 loss: 0.102941 lr: 0.0162 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:47:30,216 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 49.4 strength: 27.54 max_strength: 116.14 final_strength: 116.14 sample_efficiency: 0.000231464 training_efficiency: 5.57282e-06 stability: 1
[2021-07-24 10:50:12,254 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 95 t: 69 wall_t: 960 opt_step: 118720 frame: 3000 fps: 3.125 total_reward: 175 total_reward_ma: 70.3333 loss: 0.563105 lr: 0.0162 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:50:12,266 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 70.3333 strength: 48.4733 max_strength: 153.14 final_strength: 153.14 sample_efficiency: 0.000285103 training_efficiency: 7.07366e-06 stability: 1
[2021-07-24 10:52:52,672 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 98 t: 21 wall_t: 1121 opt_step: 138720 frame: 3500 fps: 3.12221 total_reward: 178 total_reward_ma: 85.7143 loss: 0.468518 lr: 0.01458 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:52:52,680 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 85.7143 strength: 63.8543 max_strength: 156.14 final_strength: 156.14 sample_efficiency: 0.000285316 training_efficiency: 7.12085e-06 stability: 1
[2021-07-24 10:55:34,404 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 100 t: 121 wall_t: 1282 opt_step: 158720 frame: 4000 fps: 3.12012 total_reward: 200 total_reward_ma: 100 loss: 0.259016 lr: 0.01458 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:55:34,417 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 100 strength: 78.14 max_strength: 178.14 final_strength: 178.14 sample_efficiency: 0.000275252 training_efficiency: 6.88705e-06 stability: 1
[2021-07-24 10:58:17,041 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 103 t: 188 wall_t: 1445 opt_step: 178720 frame: 4500 fps: 3.11419 total_reward: 154 total_reward_ma: 106 loss: 0.235752 lr: 0.013122 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 10:58:17,050 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 106 strength: 84.14 max_strength: 178.14 final_strength: 132.14 sample_efficiency: 0.000265999 training_efficiency: 6.66165e-06 stability: 0.926414
[2021-07-24 11:00:59,460 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 106 t: 141 wall_t: 1607 opt_step: 198720 frame: 5000 fps: 3.11139 total_reward: 178 total_reward_ma: 113.2 loss: 0.162558 lr: 0.013122 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:00:59,467 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 113.2 strength: 91.34 max_strength: 178.14 final_strength: 156.14 sample_efficiency: 0.000254717 training_efficiency: 6.38311e-06 stability: 0.939255
[2021-07-24 11:03:42,026 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 109 t: 117 wall_t: 1770 opt_step: 218720 frame: 5500 fps: 3.10734 total_reward: 179 total_reward_ma: 119.182 loss: 0.0619462 lr: 0.0118098 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:03:42,037 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 119.182 strength: 97.3218 max_strength: 178.14 final_strength: 157.14 sample_efficiency: 0.000244016 training_efficiency: 6.11727e-06 stability: 0.949639
[2021-07-24 11:06:24,799 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 112 t: 103 wall_t: 1933 opt_step: 238720 frame: 6000 fps: 3.10398 total_reward: 155 total_reward_ma: 122.167 loss: 2.09935 lr: 0.0118098 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:06:24,808 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 122.167 strength: 100.307 max_strength: 178.14 final_strength: 133.14 sample_efficiency: 0.000235461 training_efficiency: 5.90398e-06 stability: 0.934612
[2021-07-24 11:09:07,338 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 115 t: 133 wall_t: 2095 opt_step: 258720 frame: 6500 fps: 3.10263 total_reward: 169 total_reward_ma: 125.769 loss: 1.67609 lr: 0.0106288 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:09:07,346 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 125.769 strength: 103.909 max_strength: 178.14 final_strength: 147.14 sample_efficiency: 0.000226571 training_efficiency: 5.6819e-06 stability: 0.941845
[2021-07-24 11:11:50,149 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 118 t: 146 wall_t: 2258 opt_step: 278720 frame: 7000 fps: 3.10009 total_reward: 153 total_reward_ma: 127.714 loss: 2.25914 lr: 0.0106288 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:11:50,158 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 127.714 strength: 105.854 max_strength: 178.14 final_strength: 131.14 sample_efficiency: 0.000219163 training_efficiency: 5.4966e-06 stability: 0.936335
[2021-07-24 11:14:32,865 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 121 t: 91 wall_t: 2421 opt_step: 298720 frame: 7500 fps: 3.09789 total_reward: 177 total_reward_ma: 131 loss: 1.16432 lr: 0.00956594 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:14:32,873 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 131 strength: 109.14 max_strength: 178.14 final_strength: 155.14 sample_efficiency: 0.000211029 training_efficiency: 5.29295e-06 stability: 0.941969
[2021-07-24 11:17:15,534 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 124 t: 69 wall_t: 2584 opt_step: 318720 frame: 8000 fps: 3.09598 total_reward: 200 total_reward_ma: 135.312 loss: 3.20369 lr: 0.00956594 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:17:15,546 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 135.312 strength: 113.452 max_strength: 178.14 final_strength: 178.14 sample_efficiency: 0.000202587 training_efficiency: 5.08143e-06 stability: 0.947468
[2021-07-24 11:19:58,220 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 127 t: 106 wall_t: 2746 opt_step: 338720 frame: 8500 fps: 3.09541 total_reward: 152 total_reward_ma: 136.294 loss: 1.04083 lr: 0.00860934 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:19:58,234 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 136.294 strength: 114.434 max_strength: 178.14 final_strength: 130.14 sample_efficiency: 0.000196904 training_efficiency: 4.939e-06 stability: 0.926181
[2021-07-24 11:22:41,174 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 130 t: 35 wall_t: 2909 opt_step: 358720 frame: 9000 fps: 3.09385 total_reward: 183 total_reward_ma: 138.889 loss: 2.89101 lr: 0.00860934 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:22:41,183 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 138.889 strength: 117.029 max_strength: 178.14 final_strength: 161.14 sample_efficiency: 0.000190341 training_efficiency: 4.77443e-06 stability: 0.931119
[2021-07-24 11:25:24,265 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 133 t: 1 wall_t: 3072 opt_step: 378720 frame: 9500 fps: 3.09245 total_reward: 180 total_reward_ma: 141.053 loss: 5.7599 lr: 0.00774841 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:25:24,274 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 141.053 strength: 119.193 max_strength: 178.14 final_strength: 158.14 sample_efficiency: 0.000184401 training_efficiency: 4.62542e-06 stability: 0.934964
[2021-07-24 11:27:56,455 PID:27435 INFO __init__.py log_summary] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df] epi: 135 t: 156 wall_t: 3224 opt_step: 398720 frame: 10000 fps: 3.10174 total_reward: 174 total_reward_ma: 142.7 loss: 0.364511 lr: 0.00774841 explore_var: 0.1 entropy_coef: nan entropy: nan grad_norm: nan
[2021-07-24 11:27:56,466 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [train_df metrics] final_return_ma: 142.7 strength: 120.84 max_strength: 178.14 final_strength: 152.14 sample_efficiency: 0.000179087 training_efficiency: 4.49212e-06 stability: 0.936856
[2021-07-24 11:27:59,648 PID:27435 INFO __init__.py log_metrics] Trial 0 session 1 dqn_cartpole_t0_s1 [eval_df metrics] final_return_ma: 142.7 strength: 120.84 max_strength: 178.14 final_strength: 152.14 sample_efficiency: 0.000179087 training_efficiency: 4.49212e-06 stability: 0.936856
[2021-07-24 11:27:59,649 PID:27435 INFO logger.py info] Session 1 done
This is nearly one hour of running on 128 cores (apparently all are used) and then ultimately failing to achieve the pass score of 195. Could the slowness be explained by using all the CPUs and spending a lot of time on syncing?
Error logs
../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0230
seems like a glibc problem. Could you try upgrading/downgrading to a different version? similar issue discussed here rstudio/rstudio#6379 (comment)
closing issue as stale