philtabor/Reinforcement-Learning-In-Motion

Wrong array dimension/index in tdZeroApprox.py

vdouet opened this issue · 0 comments

Hi,

For the code for TD(0) approximation 'tdZeroApprox.py' in Unit-8-The-Mountaincar.

Line 19 when the numpy array 'tiledState' is created:

tiledState = np.zeros(nTiles*nTiles*nTiles)

Shouldn't it be:

tiledState = np.zeros((nTiles - 1)*(nTiles - 1)*nLayers)

Because if the number of layers is different from the number of tiles then the "tiledState" array will not have the right dimension.

Also, here nTiles = nBins = 8. So if we have 8 bins per axis we have 7 actual tiles per axis and np.digitize will return a number between 1 and 7 (because the if condition is strictly inferior/superior) so a total of 49 tiles.

Furthermore, I think idx should be idx = x * y + row * (nTiles-1)**2 - 1 because of the fact that np.digitize returns values between [1, 7] and not between [0, 7]. The values will then be for each row [0, 48], [49, 97], [98, 146], etc. Because with the original idx: idx = (x + 1) * (y + 1) + row * nTiles**2 - 1 there will be some index values that can never be used in tiledState (for example, it will always start at index n°3: (1+1) * (1+1) + 0 - 1 = 3).

I think it is because in the slide when you introduce the index equation during the chapter "Linear methods and tiling" the first possible value for x and y is (x,y) = (0,0). But the first value possible with np.digitize is (x,y) = (1,1). If we want the index equation to still be true we should maybe write:

x = np.digitize(position, pos_bins[row]) - 1
y = np.digitize(velocity, vel_bins[row]) - 1

and then we can write:

 idx = (x + 1) * (y + 1) + (row * n_tiles**2) - 1

But here is the modified code I used:

def tile_state(pos_bins, vel_bins, obs, n_bins=8, n_layers=8):

    position, velocity = obs

    # The number of tiles per axis is the number of bins per axis - 1
    n_tiles = n_bins - 1
    tiled_state = np.zeros(n_tiles * n_tiles * n_layers)

    for row in range(n_layers):

        if position > pos_bins[row][0] and \
                position < pos_bins[row][n_bins - 1]:

            if velocity > vel_bins[row][0] and \
                    velocity < vel_bins[row][n_bins - 1]:

                x = np.digitize(position, pos_bins[row])
                y = np.digitize(velocity, vel_bins[row])
                idx = (x * y) + (row * n_tiles**2) - 1

                tiled_state[idx] = 1.0

            else:
                break
        else:
            break

    return tiled_state

And I find the following result:

Update rule

But maybe there is something I did not understand?

Best regards,
Victor Douet