cts198859/deeprl_network

Whether manual parallel sampling will cause problem with LSTM design

Closed this issue · 1 comments

Hi, since sumo is too slow and do not support parallel sampling as we know, we are trying to manually construct several parallel envs during training with sumo as the core each, following a serial manner. It seems like this becomes an off-policy training process since samples from several envs are collected. While my concern is whether this will disturb the LSTM since it records global hidden states of a single env.
If we want to end up with a parallel sampling manner, is asynchronous sampling necessary?

Thanks for the insightful question. I think LSTM should be good as far as we feed the MDP trajectories, the concern is more on the off-policy side. For example, A2C follows on-policy training and the current policy is directly improved based on its performance in the last MDP trajectory. If there is a delay and targeting policy != behavior policy used for experience collection, the update may be off. Some adjustments are needed (e.g., importance sampling) to update A2C in off-policy way.