ShangtongZhang/reinforcement-learning-an-introduction

Question about batch_updating function in chapter06/random_walk.py

hitblackjack opened this issue · 1 comments

149             while True:
150                 # keep feeding our algorithm with trajectories seen so far until state value function converges
151                 updates = np.zeros(7)
152                 for trajectory_, rewards_ in zip(trajectories, rewards):
153                     for i in range(0, len(trajectory_) - 1):
154                         if method == 'TD':
155                             updates[trajectory_[i]] += rewards_[i] + current_values[trajectory_[i + 1]] - current_ values[trajectory_[i]]
156                         else:
157                             updates[trajectory_[i]] += rewards_[i] - current_values[trajectory_[i]]
158                 updates *= alpha
159                 if np.sum(np.abs(updates)) < 1e-3:
160                     break
161                 # perform batch updating
162                 current_values += updates
163             # calculate rms error
164             errors.append(np.sqrt(np.sum(np.power(current_values - TRUE_VALUE, 2)) / 5.0))
  1. Since there is no iteration of the "updates" list in the for loop of trajectories(line 152), it seems to me that all the previous trajectories don't count if the last trajectory contains all the states. This loop just overwrites the updates, so what does the "batch" mean?

  2. For the TD method, does the inner loop of one trajectory/episode(line 153)
    mean that we can't update until the whole episode terminates? What's the difference with MC?

forgive my ignorance, thanks!

I think updates for different trajectories are accumulated via +=