pat-coady/trpo

Roboschool issue (dimensionality of `action` in train.py:105)

pender opened this issue · 2 comments

Hi! I love your repo (and your blog, and your suggestions for a ML intro curriculum of MOOCs) -- thank you!

Submitting this as an issue rather than a PR because I'm not sure if I fixed the issue in the best way.

I am having an issue trying to run train.py on a roboschool environment. I added "import roboschool" to the top of train.py (which registers the Roboschool environments) and had the following result:

$ python3 train.py RoboschoolReacher-v1 -n 60000 -b 50
[2017-07-30 16:39:35,372] Making new env: RoboschoolReacher-v1
Value Params -- h1: 100, h2: 22, h3: 5, lr: 0.00213
[bunch of TF initialization...]
Traceback (most recent call last):
  File "train.py", line 325, in <module>
    main(**vars(args))
  File "train.py", line 285, in main
    run_policy(env, policy, scaler, logger, episodes=5)
  File "train.py", line 135, in run_policy
    observes, actions, rewards, unscaled_obs = run_episode(env, policy, scaler)
  File "train.py", line 105, in run_episode
    obs, reward, done, _ = env.step(action)
  File "/mnt/brian/gym/gym/core.py", line 99, in step
    return self._step(action)
  File "/mnt/brian/gym/gym/wrappers/time_limit.py", line 36, in _step
    observation, reward, done, info = self.env.step(action)
  File "/mnt/brian/gym/gym/core.py", line 99, in step
    return self._step(action)
  File "/home/pender/roboschool/roboschool/gym_reacher.py", line 53, in _step
    self.apply_action(a)
  File "/home/pender/roboschool/roboschool/gym_reacher.py", line 27, in apply_action
    self.central_joint.set_motor_torque( 0.05*float(np.clip(a[0], -1, +1)) )
TypeError: only length-1 arrays can be converted to Python scalars

I used some debug statements to determine that line 105 of train.py is calling env.step(action) when the value of action is [[-0.70904064 -0.71731383]] -- i.e. a list of shape [1, 2] rather than a one-dimensional list of length 2. The action space for the environment is Box(2,) so I think it should just be a list of two floats.

I tried changing line 105 to obs, reward, done, _ = env.step(action[0]) to eliminate the degenerate dimension and it seems to work at that point.

I'm on Ubuntu 16.04.2 LTS, TF v1.2.1, gym v0.9.1, fresh install of roboschool as of 5 minutes ago.

@pender - Sorry for the slow reply, I need to change my settings so I receive notifications.

I'll look at your solution today. Even within the MuJoCo tasks there was inconsistency with how observations were returned. I think I even ran into mixed types within an observation. As you can see, I had to do lots of brute force casting.

I think your solution was good, but I decided to use np.squeeze to remove the extra dimension. I'll push this into the master branch. I'm going to keep the aigym_evaluation branch frozen where it was when I ran all 10 MuJoCo environments. (Although the fix doesn't seem to cause a problem with the MuJoCo environments, they were just forgiving of having an extra dimension).

I'm glad you posted, I'm looking forward to trying the roboschool environments. I'm curious how the simulation speed compares.

OpenAI just posted a short PPO paper and they use a different loss function. I'll probably give that a try soon.