xphongvn/rlcomp2020

Misleading implementation of soft update target network

Closed this issue · 1 comments

I saw the implementation of soft update (original idea is in ddpg paper) for target-network in target_train()
function (DQNModel.py) . However, unlike the original idea which update the targetnetwork for each training step with tau = 0.05, the code (TrainingClient.py) call target_train() in save_model loop. This means the soft update for target-network only be done one time for 100 episode. This can lead to delayed update for target network and wrong evaluation for future Q_value. This issue might be wrong because i am only new to RL. Thank you.

I saw the implementation of soft update (original idea is in ddpg paper) for target-network in target_train()
function (DQNModel.py) . However, unlike the original idea which update the targetnetwork for each training step with tau = 0.05, the code (TrainingClient.py) call target_train() in save_model loop. This means the soft update for target-network only be done one time for 100 episode. This can lead to delayed update for target network and wrong evaluation for future Q_value. This issue might be wrong because i am only new to RL. Thank you.

Hi superuser992002,
Thank you for your comment! You are right when mentioning the idea of the soft update for target-network in the DQNModel.py. In DQNModel.py, we apply the DDPG idea of updating the target network with tau = 0.05; however, we still keep the original idea of Deep Q-learning when the target network is updated every C steps. This approach might allow the training process be more stable even though the learning may be slow. Therefore, the delayed update for the target network will not lead to the imprecise evaluation on the next Q-value.
For this MinerCraft game, you might try some other techniques for updating the target network or keep the orginal ideas of DQN or DDPG in order to find the best approach. Have a good luck! Feel free to have a further discussion.