Reproducing the results of the paper
Closed this issue · 6 comments
Hi,
I tried reproducing the results of the paper on HalfCheetah-v2. I ran the following commands.
python3 scripts/train_agent.py "./runs/halfcheetah_checkpoints" SB3_ON HalfCheetah-v2 cuda '{"ALGO": "PPO"}' --save_freq=10000
python3 scripts/generate_plane_jobs.py --grid-size=31 --magnitude=1.0 --num-steps=200000 "runs/halfcheetah_checkpoints/0200000" "runs/halfcheetah_surface"
python3 scripts/run_jobs_multiproc.py --num-cpus=14 "runs/halfcheetah_surface/jobs.sh"
python3 scripts/job_results_to_csv.py "runs/halfcheetah_surface"
python3 scripts/plot_plane.py "runs/halfcheetah_surface/results.csv" --outname="runs/halfcheetah" --env_name="HalfCheetah-v2"
These are the commands from the README except that I loaded the checkpoint at 200000 steps, which was also done in the paper according to table 2. I ran the experiment two times and found that the results of both runs look quite different from the results reported in figure 11.
So I was wondering whether there are any hyperparameters that I missed or whether I need to do anything else different to obtain results similar to those in the paper.
Hi, I believe our results look different because we used the trained checkpoint after 1 million timesteps, not 200,000. I assume the agents you show here have not converged to the same policy that we found in our paper yet. Table 2 in our paper refers to the number of evaluation steps used to evaluate each point in the plot. The correct number of training steps for each environment can be found in the hyperparameters folder. If you don't change any settings, it will use the correct number (according to RL Zoo) by default.
Ok, then I misinterpreted Table 2. Thanks for clarifying.
Just to be sure: If I don't change any settings, the algorithm chooses the best performing checkpoint (runs/halfcheetah_checkpoints/best). Is this how the plots in the paper were generated?
Yes that is correct. We plotted the best checkpoint for each environment after training with the hyperparameters found in the hyperparameters folder. The settings in Table 2 are for the generate_plane_jobs.py
script
I believe those are almost correct, I just chose to plot these in a linear scale instead of the logarithmic scale (which is the default for reward scales this large) in the paper. You can disable the logarithmic scale by passing the flag --logscale off
to the plot_plane.py
script.