chuyangliu/snake

Questions about the local state usage and PER updates

voaneves opened this issue · 9 comments

@chuyangliu Could you help me, please, with some questions?

  1. I read in your docs, about the algorithms, that you're using a local state vector:

The second part is the local state vector, which tells the snake its surrounding situation. The vector contains 3 values (0 or 1), indicating whether the point in front/left/right of the snake head is dangerous (i.e., wall or body in the direction).

I'm achieving more or less the same performance as you in the global state, but the local state brought, as you showed, huge improvements. In you code, you verify which of those (3 or 4) positions are safe, flatten and then you stack them horizontally (axis = 1) with the stack vector, right? Is it enough to make it usable by the model? Also, everything that is food or body or important, you assign as 1. But does the network understand by itself which is which (to get objectives like food, or avoid the body...)?

Your code

def _state(self):
        """Return a vector indicating current state."""

        # Visual state
        visual_state = np.zeros(self._SHAPE_VISUAL_STATE, dtype=np.int32)
        for i in range(1, self.map.num_rows - 1):
            for j in range(1, self.map.num_cols - 1):

                pos = Pos(i, j)
                if self._USE_RELATIVE:
                    if self.snake.direc == Direc.LEFT:
                        pos = Pos(self.map.num_rows - 1 - j, i)
                    elif self.snake.direc == Direc.UP:
                        pos = Pos(i, j)
                    elif self.snake.direc == Direc.RIGHT:
                        pos = Pos(j, self.map.num_cols - 1 - i)
                    elif self.snake.direc == Direc.DOWN:
                        pos = Pos(self.map.num_rows - 1 - i, self.map.num_cols - 1 - j)

                t = self.map.point(pos).type
                if t == PointType.EMPTY:
                    visual_state[i - 1][j - 1][0] = 1
                elif t == PointType.FOOD:
                    visual_state[i - 1][j - 1][1] = 1
                elif t == PointType.HEAD_L or t == PointType.HEAD_U or \
                     t == PointType.HEAD_R or t == PointType.HEAD_D:
                    visual_state[i - 1][j - 1][2] = 1
                elif t == PointType.BODY_LU  or t == PointType.BODY_UR or \
                     t == PointType.BODY_RD  or t == PointType.BODY_DL or \
                     t == PointType.BODY_HOR or t == PointType.BODY_VER:
                    visual_state[i - 1][j - 1][3] = 1
                else:
                    raise ValueError("Unsupported PointType: {}".format(t))

        if self._USE_VISUAL_ONLY:
            return visual_state.flatten()
        else:
            # Important state
            important_state = np.zeros(self._NUM_IMPORTANT_FEATURES, dtype=np.int32)
            head = self.snake.head()

            if self._USE_RELATIVE:
                for i, action in enumerate([SnakeAction.LEFT, SnakeAction.FORWARD, SnakeAction.RIGHT]):
                    direc = SnakeAction.to_direc(action, self.snake.direc)
                    if not self.map.is_safe(head.adj(direc)):
                        important_state[i] = 1
            else:
                for i, direc in enumerate([Direc.LEFT, Direc.UP, Direc.RIGHT, Direc.DOWN]):
                    if not self.map.is_safe(head.adj(direc)):
                        important_state[i] = 1

            return np.hstack((visual_state.flatten(), important_state))

My code

    def state(self):
        """Create a matrix of the current state of the game."""
        body = self.snake.return_body()
        canvas = zeros((var.BOARD_SIZE, var.BOARD_SIZE))

        for part in body:
            canvas[part[0] - 1, part[1] - 1] = 1.

        canvas[self.food_pos[0] - 1, self.food_pos[1] - 1] = .5

        return canvas
  1. For implementing PER, what weights do you use? Absolute difference between Q(s,a) and Q(s',a') or the MSE? Do you always update the PER in the observations you batched?

Originally posted by @Neves4 in 5a18244#commitcomment-30772870

Follow-up question:

  1. Do you think it's possible to use the distance to objects in the local state vector, instead of safe or not (bool)?

  2. About memory also, do you think other types of trees could improve prioritization? Like radix-tree or heap queues...

  1. That's an interesting thought. However, I think it would not be better than boolean local state vector since you lose the direction information (i.e., which direction is danger). This is just a guess, why not try to implement this and see the results?

  2. I don't think using heap queues can meet the requirement because what we want is to sample experiences according to their priority, meaning that the higher priority the experience is, the more likely it will be chosen. Using heap queue can only help us choose the experience with the highest priority, with a probability of 100%. Also, I have no idea how to implement this using radix tree. The time complexity for sum tree is O(log n), where n is the total number of samples, which is a fast enough choice I think. If you are interested in more about sum tree, please feel free to read my article.

Thanks a lot for the answers! Btw, very informative article, helped me understand the algorithm.
Throughout the learning process, it's possible that the earliest experiences are not as good as the latest, right? In my current implementation, I was thinking about renewing the beginning of the experience replay when 90% finished, as batching these poor experiences will actually worsen the model. So PER, as it batches based on priorities, solves this problem?

Another question, have you heard about Hindsight Experience Replay and know about any implementation? How do you think it compares to PER?
@chuyangliu

You are welcome! PER considers an experience "good" if the experience produces large error between the actual value and predict value of Q generated by current model. Personally I think it's more reliable than just thinking "latest experiences may be better than earliest experiences". But again, experiments are necessary to finally confirm which is better. 😃

And sorry I haven't heard about Hindsight Experience Replay since I am not focusing on reinforcement learning recently.

Thanks again! I agree with you, but as I'm still improving my old model I'm trying to extract the most out of it. Two more questions:

  1. When you say you trained in 3000000 interations, is it the memory size? number of episodes?
  2. I noticed that you doesn't stack your states for like 4 frames, in example. Instead you just learn each _FREQ_LEARN, right? Do you take the current motion into account anywhere else? I'm having problems with dimensions while implementing the important spots we discussed and stacking it... Just to justify the question:

We have all these frames, but what gets fed as input to the “Q-Network”? The NATURE and NIPS paper used a sequence of four game frames stacked together, making the data dimension (4,84,84). (This parameter is called the “phi length”) The idea is that the action agents choose depends on the prior sequence of game frames. Imagine playing Breakout, for instance. Is the ball moving up or down? If the ball is moving down, you better get the paddle in position to bounce it back up. If the ball is moving up, you can wait a little longer or try to move in the opposite direction as needed if you think the ball will eventually reach there. -> Link to article

  1. It's the number of episodes.
  2. Yes, I didn't stack the states for 4 frames like the paper suggests. Instead I just consider a training sample as only one experience/frame. And I think I didn't take the current action into account anywhere else.

Follow-up on question number 1 (thanks for all the answers though! sorry for being a noob haha):

  1. How big is your final memory (in observation numbers)? Where are you training it (how big is the GPU, RAM, etc)? I'm currently using Google Colab (Tesla K80 GPU and 12gb ram) with 10000 episodes and sometimes I feel that I run OOM...

The size of the memory is 100,000 if that's what you are looking for. I trained on my laptop with NVIDIA GeForce GTX 960M. 3,000,000 iterations took about 24 hours.

@chuyangliu Thanks a lot for all the answers. It helped me a lot while implementing the improvements and now I'm heading to A3C and Impala.

All the best,
Victor.