JuliaML/Reinforce.jl

Taking action based on set of states

Closed this issue · 3 comments

When using epsilon-greedy methods to take action, neural network predicts which action to take based on input state. Recently, there have been developments wherein instead of one state (ie the current state), neural network takes the difference between current state and the state one timestep before (s_t - s_t-1). Or it may accept a set of states as input and predict an action.

I am guessing if such an action function has to be implemented, we need to modify the call to action. I am interested in developing this functionality. Can anyone point me in the right direction?

Is there any paper that I can consult?

neural network takes the difference between current state and the state one timestep before (s_t - s_t-1).

I will consider this is a part of feature extraction, or consider it as an internal state of a policy.
This is my imagination:

memory_buffer = []
ϕ!(s) = ...  # store state into memory_buffer and do feature extraction
ep = Episode(env, π)

for (s, a, r, s′) in ep
  ϕ!(s)
  ...
end

or

ep = Episode(env, π)

for (s, a, r, s′) in ep
  ...
  π.last_state = s
end

the difference between current state and the state one timestep before.

I have some time series application need to roll out time window for neural nets. My neural nets need
s_t - s_t-1, s_t - s_t-2 ... to s_t - s_t-n as input. In this case, I did feature extraction first, create a
larger table, and make this new table as my environment. Thus, the state from new environment is a complete time window.
In case that we cannot determine s_t - s_t-1 first, I think my previous snippets are okay.

But I'm not sure the design philosophy in original paper, maybe we need change the design of action.

Thanks for the reply!
I am not aware of paper having s_t - s_t-1, but this code of 'Pong from Pixels' has it. I wanted to implement such a design for my implementation.

ep = Episode(env, π)
for (s, a, r, s′) in ep
...
π.last_state = s
end

In this case isn't the episode already over, where only s_t was taken into account to predict action?

In this case isn't the episode already over, where only s_t was taken into account to predict action?

well, inside the for loop, the episode isn't over yet. The underlying implementation of Episode is that it supports iteration protocol. You can manually iter it to check what's going on via start(...); next(...); next(...) ... etc.

I think my snippet can be refined as following.

function action::MyPolicy, r, s)
   a = π(s .- π.last_s) # do action selection stuffs
   π.last_s = s
   return a
end

ep = Episode(env, π)

for (s, a, r, s′) in ep
    # ...
end
# episode end