Cannot reproduce results on benchmark maze

Question

Cannot reproduce results on benchmark maze

Closed this issue 2 years ago · 11 comments

ManuelEberhardinger commented 2 years ago

Hey,

I trained a model with the official hyperparameters you provided and the returned command by python train_scripts/make_cmd.py --json minigrid/60_blocks_uniform/mg_60b_uni_accel_empty.json. After fully training the model without any errors, I used the model to evaluate it on the benchmark maze. The table below shows the results of the model with the 20000 steps and also the final model, which was stored in model.tar. ACCEL 1 and ACCEL 2 are both models reported in the appendix of the published paper on page 25 Table 5:

Level	ACCEL 1	ACCEL 2	Model_20000.tar	Model.tar
Maze2	0.93	1.0	0.45	0.75
LargeCorridor	0.94	1.0	0.72	0.48
PerfectMazeMedium	0.88	0.93	0.42	0.46

When I use the model after 20000 steps, which also the paper uses(if I understood it correctly), the results are only half as good as the published results. With the fully trained model the results are better but there is still a big gap.

Do you know what could be the reason for that? I thought that the RL policy is more stable because of the use of UED in reproducing the results.

I will also re-train the agent from scratch to see if the results are then reproducible, but this will take one week.

Thanks in advance!

Answer 1 · 2022-11-15T11:11:05.000Z

Hi Manuel, can you share the output of your meta.json for this training run, generated in the results folder specified by --log_dir?

So I understand correctly, the columns model_20000.tar and model.tar are the results from models generated by your training run, which was started from scratch?

Can you also confirm whether the values you show for these two columns are the solved_rate and not the test_return?

Answer 2 · 2022-11-15T12:26:29.000Z

Oh okay, I reported the test_returns which were printed by eval.py (see below output). How can I calculate the solved_rate then? I didn't found any args for the solved_rate in the eval.py-arguments.

Yes model_20000.tar and model.tar are the same model I trained from scratch. These was the output for both models:

#python -m eval  --base_path workspaces/logs/accel --model_tar model_20000 --benchmark maze

test_returns:MultiGrid-Maze2-v0: 0.45 +/- 0.32
test_returns:MultiGrid-LargeCorridor-v0: 0.72 +/- 0.15
test_returns:MultiGrid-PerfectMazeMedium-v0: 0.42 +/- 0.39
test_returns:MultiGrid-PerfectMazeLarge-v0: 0.14 +/- 0.30
test_returns:MultiGrid-PerfectMazeXL-v0: 0.02 +/- 0.11
-----------------------------------------------------
| iq_test_returns:MultiGrid-L... | 0.71--0.75--0.77 |
| iq_test_returns:MultiGrid-M... | 0.11--0.50--0.76 |
| iq_test_returns:MultiGrid-P... | 0.00--0.00--0.00 |
| test_returns:MultiGrid-Larg... | 0.72 +/- 0.15    |
| test_returns:MultiGrid-Maze... | 0.45 +/- 0.32    |
| test_returns:MultiGrid-Perf... | 0.02 +/- 0.11    |
-----------------------------------------------------

#python -m eval  --base_path /home/ma/e/eberhardinger/workspaces/logs/accel --model_tar model --benchmark maze

test_returns:MultiGrid-Maze2-v0: 0.75 +/- 0.14
test_returns:MultiGrid-LargeCorridor-v0: 0.48 +/- 0.43
test_returns:MultiGrid-PerfectMazeMedium-v0: 0.46 +/- 0.37
test_returns:MultiGrid-PerfectMazeLarge-v0: 0.20 +/- 0.32
test_returns:MultiGrid-PerfectMazeXL-v0: 0.08 +/- 0.21
-----------------------------------------------------
| iq_test_returns:MultiGrid-L... | 0.00--0.87--0.90 |
| iq_test_returns:MultiGrid-M... | 0.77--0.78--0.78 |
| iq_test_returns:MultiGrid-P... | 0.00--0.00--0.00 |
| test_returns:MultiGrid-Larg... | 0.48 +/- 0.43    |
| test_returns:MultiGrid-Maze... | 0.75 +/- 0.14    |
| test_returns:MultiGrid-Perf... | 0.08 +/- 0.21    |
-----------------------------------------------------

The meta.json file. I also see that successful is here false but the training command finished and there was also no exception thrown.. So I don't know why this is false here?

{
    "args": {
        "adv_clip_reward": null,
        "adv_entropy_coef": 0.0,
        "adv_max_grad_norm": 0.5,
        "adv_normalize_returns": false,
        "adv_num_mini_batch": 1,
        "adv_ppo_epoch": 5,
        "adv_use_popart": false,
        "algo": "ppo",
        "alpha": 0.99,
        "antagonist_plr": false,
        "archive_interval": 5000,
        "base_levels": "easy",
        "checkpoint": true,
        "checkpoint_basis": "student_grad_updates",
        "checkpoint_interval": 100,
        "choose_start_pos": false,
        "clip_param": 0.2,
        "clip_reward": null,
        "clip_value_loss": true,
        "crop_frame": false,
        "disable_checkpoint": false,
        "entropy_coef": 0.0,
        "env_name": "MultiGrid-GoalLastEmptyAdversarialEnv-Edit-v0",
        "eps": 1e-05,
        "frame_stack": 1,
        "gae_lambda": 0.95,
        "gamma": 0.995,
        "grayscale": false,
        "handle_timelimits": true,
        "level_editor_method": "random",
        "level_editor_prob": 1.0,
        "level_replay_alpha": 1.0,
        "level_replay_eps": 0.05,
        "level_replay_prob": 0.8,
        "level_replay_rho": 0.5,
        "level_replay_schedule": "proportionate",
        "level_replay_score_transform": "rank",
        "level_replay_seed_buffer_priority": "replay_support",
        "level_replay_seed_buffer_size": 4000,
        "level_replay_strategy": "positive_value_loss",
        "level_replay_temperature": 0.3,
        "log_action_complexity": true,
        "log_dir": "workspaces/logs/accel",
        "log_grad_norm": false,
        "log_interval": 25,
        "log_plr_buffer_stats": true,
        "log_replay_complexity": true,
        "lr": 0.0001,
        "max_grad_norm": 0.5,
        "max_rad_ratio": 1.0,
        "min_rad_ratio": 0.333333333,
        "model_finetune": "model",
        "no_cuda": false,
        "no_exploratory_grad_updates": true,
        "normalize_returns": false,
        "num_action_repeat": 1,
        "num_control_points": 12,
        "num_edits": 5,
        "num_env_steps": 250000000,
        "num_goal_bins": 1,
        "num_mini_batch": 1,
        "num_processes": 32,
        "num_steps": 256,
        "ppo_epoch": 5,
        "protagonist_plr": false,
        "recurrent_adversary_env": false,
        "recurrent_agent": true,
        "recurrent_arch": "lstm",
        "recurrent_hidden_size": 256,
        "reject_unsolvable_seeds": false,
        "render": false,
        "reward_shaping": false,
        "screenshot_batch_size": 1,
        "screenshot_interval": 1000,
        "seed": 88,
        "singleton_env": false,
        "sparse_rewards": false,
        "staleness_coef": 0.3,
        "staleness_temperature": 1.0,
        "staleness_transform": "power",
        "test_env_names": "MultiGrid-SixteenRooms-v0,MultiGrid-Maze-v0,MultiGrid-Labyrinth-v0",
        "test_interval": 250,
        "test_num_episodes": 10,
        "test_num_processes": 2,
        "train_full_distribution": true,
        "ued_algo": "domain_randomization",
        "use_categorical_adv": false,
        "use_editor": true,
        "use_gae": true,
        "use_global_critic": false,
        "use_global_policy": false,
        "use_plr": true,
        "use_popart": false,
        "use_reset_random_dr": false,
        "use_sketch": true,
        "use_skip": false,
        "value_loss_coef": 0.5,
        "verbose": false,
        "weight_log_interval": 0,
        "xpid": "ued-MultiGrid-GoalLastEmptyAdversarialEnv-Edit-v0-domain_randomization-noexpgrad-lstm256a-lr0.0001-epoch5-mb1-v0.5-gc0.5-henv0.0-ha0.0-plr0.8-rho0.5-n4000-st0.3-positive_value_loss-rank-t0.3-editor1.0-random-n5-baseeasy-tl_0",
        "xpid_finetune": null
    },
    "date_end": null,
    "date_start": "2022-11-06 18:13:33.385019",
    "env": {
        "CAML_LD_LIBRARY_PATH": "/container/.opam/4.06.1+flambda/lib/stublibs:/container/.opam/4.06.1+flambda/lib/ocaml/stublibs:/container/.opam/4.06.1+flambda/lib/ocaml",
        "CONDA_AUTO_ACTIVATE_BASE": "false",
        "CONDA_EXE": "/opt/miniconda3/bin/conda",
        "CONDA_PYTHON_EXE": "/opt/miniconda3/bin/python",
        "CONDA_SHLVL": "0",
        "CUDA_VISIBLE_DEVICES": "0",
        "CURL_CA_BUNDLE": "/usr/local/conda/lib/python3.7/site-packages/certifi/cacert.pem",
        "HOME": "/container/",
        "KMP_DUPLICATE_LIB_OK": "True",
        "KMP_INIT_AT_FORK": "FALSE",
        "LC_CTYPE": "C.UTF-8",
        "LD_LIBRARY_PATH": ":/opt/miniconda3/lib/:/opt/miniconda3/envs/my_new_env/lib/",
        "LESSCLOSE": "/usr/bin/lesspipe %s %s",
        "LESSOPEN": "| /usr/bin/lesspipe %s",
        "LOGNAME": "eberhardinger",
        "LS_COLORS": "rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.webp=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:",
        "MANPATH": ":/container/.opam/4.06.1+flambda/man",
        "NUM_GPUS": "",
        "NVIDIA_DRIVER_CAPABILITIES": "compute,utility,video",
        "NVIDIA_REQUIRE_CUDA": "cuda>=7.5",
        "NVIDIA_VISIBLE_DEVICES": "all",
        "OCAML_TOPLEVEL_PATH": "/container/.opam/4.06.1+flambda/lib/toplevel",
        "OLDPWD": "/home/ma/e/eberhardinger/workspaces",
        "OMP_NUM_THREADS": "1",
        "OPAM_SWITCH_PREFIX": "/container/.opam/4.06.1+flambda",
        "PATH": "/opt/miniconda3/condabin:/container/.opam/4.06.1+flambda/bin:/usr/local/conda/bin:/container/pypy3.6-v7.3.3-linux64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "PKG_CONFIG_PATH": "/container/.opam/4.06.1+flambda/lib/pkgconfig",
        "PWD": "/home/ma/e/eberhardinger/workspaces/dcd",
        "SHELL": "/usr/bin/bash",
        "SHLVL": "1",
        "SLURMD_NODENAME": "tardis",
        "SLURM_CPUS_ON_NODE": "4",
        "SLURM_JOBID": "49083",
        "SLURM_JOB_PARTITION": "gpu",
        "SLURM_JOB_USER": "eberhardinger",
        "START_TIME": "",
        "TERM": "screen",
        "TF2_BEHAVIOR": "1",
        "TF_CPP_MIN_LOG_LEVEL": "2",
        "TIME_LIMIT": "",
        "USER": "root",
        "_": "/usr/local/conda/bin/python",
        "_CE_CONDA": "",
        "_CE_M": ""
    },
    "git": null,
    "slurm": null,
    "successful": false,
    "xpid": "ued-MultiGrid-GoalLastEmptyAdversarialEnv-Edit-v0-domain_randomization-noexpgrad-lstm256a-lr0.0001-epoch5-mb1-v0.5-gc0.5-henv0.0-ha0.0-plr0.8-rho0.5-n4000-st0.3-positive_value_loss-rank-t0.3-editor1.0-random-n5-baseeasy-tl_0"
}

Answer 3 · 2022-11-15T12:39:50.000Z

Thanks for sharing the extra context! The eval.py should create an output .csv file inside a folder called results in your project's root directory. That result file should include the solved rates.

Make sure to pass in the argument --accumulator=mean when running eval.py. I have updated the README.md with this information.

Answer 4 · 2022-11-15T12:54:25.000Z

I think I found the problem, why I didn't had any solve rates reported... I need to set the accumulator to mean

Yes that was the problem. I was just evaluating with the default arguments on the benchmark maze

Answer 5 · 2022-12-15T09:59:03.000Z

Hey Minqi,

I would have one more question about the level sampler, as I still have trouble understanding it properly. I attached a screenshot of the saved level screenshots every 1000 iterations and somehow, all images which have n_edits > 0 are also replay=True and all levels which have no edits are not replayed. Could this be correct?

I thought that when ACCEL edits a level than replay should be False as after editing it is a new level and not replayed from the LevelStore. Am I wrong in my assumption?
I also restarted the training and took every 100 iterations a screenshot and there I can observe the same issue.

Thanks in advance!

Answer 6 · 2022-12-15T10:06:42.000Z

Hi Manuel,

By default the current screenshot logic simply saves the image of the first level in a rollout batch of num_processes levels every screenshot_interval rollouts (each rollout being a PPO update cycle). That means the screenshot can occur on a rollout batch that is either using replay levels or new levels.

This implementation of ACCEL edits levels after every replay rollout. The edited levels are then immediately evaluated, and only the edited levels that have higher PLR score than the lowest score in the level buffer will be added to the buffer. No screenshot logic applies to these edit evaluation rollouts. That means the only rollouts on edited levels for which screenshotting is applicable would be replay rollouts, when that edited level is sampled from the level buffer, as that is the only way the agent can visit an edited level.

Answer 7 · 2022-12-15T10:19:55.000Z

Thanks for the quick reply! Okay, that makes sense now!

But that also means that ACCEL has problems with catastrophic forgetting due to the many n_edits=0 screenshots, since it has to revisit the levels without edits? Or is my interpretation wrong?

Answer 8 · 2022-12-15T10:22:44.000Z

n_edits=0 can result from screenshots of brand new levels sampled from the simulator (i.e. not a replay rollout) or if the agent replays a level from the level buffer that has not yet experienced an edit. The latter case, if repeatedly occurring for the same level, would point to that level being especially challenging for the agent.

In general, I don't think you can read into the existence of n_edits=0 screenshots as a sign of catastrophic forgetting.

Answer 9 · 2022-12-15T10:38:57.000Z

Thanks for the clarification!

I also came to the conclusion when I evaluated the 20000 step and 25000 step models because the solved rate decreased significantly for the large corridor environment. I thought that one reason for this was the difference in the structure of the environment compared to the maze environment, and also that ACCEL creates more environments that look more like a maze than the large corridor. So I thought that the n_edits=0 are also a sign for it.

Level	Model 20000 steps	Model 25000 steps
Maze2	0.82	1.00
LargeCorridor	0.99	0.38
PerfectMazeMedium	0.59	0.78

Answer 10 · 2022-12-15T11:01:54.000Z

Yes, that is a plausible cause for this kind of instability in performance. The policy is also stochastic, so performance on many of the environments can have significant variance.

You are right that the UED problem setting is a kind of continual learning, so catastrophic forgetting will play a role here. We briefly mentioned this connection in the concluding discussion of the Robust PLR paper: https://arxiv.org/abs/2110.02439

Answer 11 · 2022-12-15T11:12:01.000Z

Thank you very much! I will read the Robust PLR paper too. Now it all makes much more sense to me :)