RyanNavillus/reward-surfaces

Reproducing the results of the paper

Closed this issue · 6 comments

Hi,

I tried reproducing the results of the paper on HalfCheetah-v2. I ran the following commands.

python3 scripts/train_agent.py "./runs/halfcheetah_checkpoints" SB3_ON HalfCheetah-v2 cuda '{"ALGO": "PPO"}' --save_freq=10000
python3 scripts/generate_plane_jobs.py --grid-size=31 --magnitude=1.0 --num-steps=200000 "runs/halfcheetah_checkpoints/0200000" "runs/halfcheetah_surface"
python3 scripts/run_jobs_multiproc.py --num-cpus=14 "runs/halfcheetah_surface/jobs.sh"
python3 scripts/job_results_to_csv.py "runs/halfcheetah_surface"
python3 scripts/plot_plane.py "runs/halfcheetah_surface/results.csv" --outname="runs/halfcheetah" --env_name="HalfCheetah-v2"

These are the commands from the README except that I loaded the checkpoint at 200000 steps, which was also done in the paper according to table 2. I ran the experiment two times and found that the results of both runs look quite different from the results reported in figure 11.

halfcheetah_episoderewards_3dsurface_1
halfcheetah_episoderewards_3dsurface_2

So I was wondering whether there are any hyperparameters that I missed or whether I need to do anything else different to obtain results similar to those in the paper.

Hi, I believe our results look different because we used the trained checkpoint after 1 million timesteps, not 200,000. I assume the agents you show here have not converged to the same policy that we found in our paper yet. Table 2 in our paper refers to the number of evaluation steps used to evaluate each point in the plot. The correct number of training steps for each environment can be found in the hyperparameters folder. If you don't change any settings, it will use the correct number (according to RL Zoo) by default.

Ok, then I misinterpreted Table 2. Thanks for clarifying.
Just to be sure: If I don't change any settings, the algorithm chooses the best performing checkpoint (runs/halfcheetah_checkpoints/best). Is this how the plots in the paper were generated?

Yes that is correct. We plotted the best checkpoint for each environment after training with the hyperparameters found in the hyperparameters folder. The settings in Table 2 are for the generate_plane_jobs.py script

I ran the code for runs/halfcheetah_checkpoints/best but the results still look different from those of the paper. Below are the results of two runs in this configuration.
halfcheetah_best_00_episoderewards_3dsurface
halfcheetah_best_01_episoderewards_3dsurface

I believe those are almost correct, I just chose to plot these in a linear scale instead of the logarithmic scale (which is the default for reward scales this large) in the paper. You can disable the logarithmic scale by passing the flag --logscale off to the plot_plane.py script.

Ah, I missed that. With the --logscale off argument the plots look similar to those in the paper.

halfcheetah_best_00_linear_episoderewards_3dsurface
halfcheetah_best_01_linear_episoderewards_3dsurface

Thanks for your help!