DQN_Cartpole Refer to https://zhuanlan.zhihu.com/p/21477488 (1)动作响应: state + action = reward + next_state (2)DQN input: state, action output: reward + γ•max(Q(next state, all actions)) Q_action为动作价值函数