Question about batch_updating function in chapter06/random_walk.py
hitblackjack opened this issue · 1 comments
hitblackjack commented
149 while True:
150 # keep feeding our algorithm with trajectories seen so far until state value function converges
151 updates = np.zeros(7)
152 for trajectory_, rewards_ in zip(trajectories, rewards):
153 for i in range(0, len(trajectory_) - 1):
154 if method == 'TD':
155 updates[trajectory_[i]] += rewards_[i] + current_values[trajectory_[i + 1]] - current_ values[trajectory_[i]]
156 else:
157 updates[trajectory_[i]] += rewards_[i] - current_values[trajectory_[i]]
158 updates *= alpha
159 if np.sum(np.abs(updates)) < 1e-3:
160 break
161 # perform batch updating
162 current_values += updates
163 # calculate rms error
164 errors.append(np.sqrt(np.sum(np.power(current_values - TRUE_VALUE, 2)) / 5.0))
-
Since there is no iteration of the "updates" list in the for loop of trajectories(line 152), it seems to me that all the previous trajectories don't count if the last trajectory contains all the states. This loop just overwrites the updates, so what does the "batch" mean?
-
For the TD method, does the inner loop of one trajectory/episode(line 153)
mean that we can't update until the whole episode terminates? What's the difference with MC?
forgive my ignorance, thanks!
ShangtongZhang commented
I think updates for different trajectories are accumulated via +=