/wheeler

Primary LanguageJupyter NotebookMIT LicenseMIT

Wheeler - Reinforcement Learning experimental framework ( Policy Gradients )

  • WIP ~ work in progress, implementing + testing ideas

  • Notes on Policy Gradients methods, related to this project : https://medium.com/reinforcement-learning-w-policy-gradients

  • I focus mainly on Policy Gradients for now, and main engines implemented are DDPG + PPO, via ActorCritic ( full list of features you can read bellow )

  • Key feature of this framework are :

    • multi-agent + multi-PG-methods cooperation

      • f.e. 3 agents, every of them different RL method : DDPG, PPO, DQN ( later mention not implemented yet )
      • form of cooperation :
        • sharing experiences ( locals + central priority replay buffers )
        • 1 target model ( DPPG ), others explorers ( PPO ) :
          • same sized function approximators models ( bit part of it at least )
          • every X-time stamp sync explorer ( PPO ) with target ( DDPG ) - copy target agent weights to explorers agents ( most of the model, except last layers )
          • this way, sample efficient approaches like PPO can take current main agent, and try to explore environment by copying his behaviour ( while preserving some part of its own specific ~ last layers )
          • therefore we get more related experiences to our replay buffer, which are related to current main agent ( DDPG ) via sample efficient method ( PPO ) ~ theory : fast + efficient exploration of environment + more robust and independent experiences - illustration code
       stop_q = Queue() # signalizing when to stop training
       memory = coss_buffer(CFG) # our shared buffer 
       
       worker = Process( # our efficient explorer
       	target = agent_launch,
       	args = (1, PPO_CFG, ppo_callback, memory..))
       worker.start() # we can have multiple explorers ( or even targets )
       # now main agent, run it in main process
       agent_launch(0, DDPG_CFG, ddpg_callback, memory..)
       print("learning finished")
       ...
       def ppo_callback(_, agent, _): # every round we sync with main agent
       	agent.bot.alpha_sync(["neslayer_2_"])
       	return False # we dont evaluate here
       def dppg_callback(task, _, scores): # just eval
       	return task.goal_met(np.mean(scores, 0))
    • multi-task

      • f.e. separate 3D navigation to 3+ different reward functions
      • every sub-task has different critic ( and reward function ) but common actor ( can be detached )
        • detached : SoftActorCritic fashion, have target + explorer; while target is common, explorer can be per sub-task specific
  • From design perspective, this are some nice things you can be interested in to look at :

  • DISACLAIMER : now framework is too slow ( especially with replay buffer and reanalyzing ), and most of features mentioned {RNN,multi-*,} altougth implemented but not properly tested; all benchmark implementations was tested on simple DPPG setup with NoisyNetwork and N-step/GAE and SoftActorCritic ( well i just refactor whole design, so make clean benchmarks was first milestone, next on is work on features; first to go : RNN, second Reacher benchmark implementation, third Reacher HER implementation, forth Reacher + HER + multi bot, fifth Reacher + HER + multi bot + multi task, and so on )

  • RESEARCH vs overkill : given number of features ( later part ), and it performance, it may look like more proper name for framework should be "overkiller". Main goal of this project is to test ideas and build some environment where i can interchangeably to turn them on or off while core should works. on the other side my benchmark implementation of Reacher ( 20 arms, no HER, just advanatage, n-step, DDPG with shared replay buffer ~ and so therefore fast ) you can find at : https://github.com/rezer0dai/ReacherBenchmarkDDPG

Benchmark Implementations

Fatures & Ideas of Wheeler experiment :

  1. Soft Actor-Critic, Off-Policy, Asynchronous, DDPG, PPO, vanilla PG, td-lambda, GAE, GRU/LSTM, Noisy Networks, HER, RBF, state stacking, OpenAi-Gym + Unity-MlAgents, PyTorch

  2. multiple agents ( with potentially different algorithms DDPG + PPO ), working together ( envs/experimental/reacher_her.py )

  3. multiple critics for one actor, every critic ( task / simulation ) has its own environment to test actor in parallel

  4. MULTIPLE GOALS : Via separating critics, you are able to define multiple reward functions ( preferable as sparse as possible )

    • imagine 3D navigation, therefore you can separate it to 3 reward function each rewards how far on particular dimension
    • or task where is most important end of episode rather the start, while start also important ( lunar lander )
    • reach + fetch + move tasks
  5. Noisy Networks for exploration ( as head after RNN )

  6. Replay buffer with support for HER by default ( and curiosity as priority weights + decider what to forgot ~ better say if new experience is worth of remembering )

  7. Attention mechanism (Experimental + not properly tested yet) on reward gradients ( for multi-ple simulations with different rewards functions ) applied for learning actor

  8. RNN as actor network - GRU and also LSTM ( can be done also for critic ) for enriching markovian state for RL

  9. easy to adapt encoding on top of it : RBF sampler, pre-trained CNN, normalizers

  10. built in support for normalization of states ( see openai implementation of Reacher )

X. other experiments :

  • default SoftMax Policy implementation for discrete actions
  • MonteCarlo search configuration ( how many times to replay with same seed )
  • immidiate learning for latest experience ( can be postponed, learn very x-time stamp, learn y-times )
  • postponed learning, at the end of the episode, from replay buffer only ( from config, you can select how much )
  • select steps from episode you consider important ( good in task.py, and good_reach in cfg.toml ) to save to replay buffer
  • originally supposed to be ML framework independent ( pytorch + keras as backend, but when i move to DDPG, i move towards pytorch, in future i may again reintroduce keras/tensorflow )

...framework works against classic four ( Acro, Pendelum, MountainCart, CartPole ) and it is its baseline

Requirements :

  • pytorch
  • openai gym
pip install gym
cd wheeler
  • toml : pip install toml
  • jupyter notebook + test envs/*.ipynb

TODO :

Blog on what i have larned from OpenAi Gym set : Acro, Pendelum, MountainCar, CartPole, and also from Reacher :

  1. Vanila policy gradients on cartpole (1 - action) log vs distributions
  2. DPPG vs Vanila PG on continuous : MountainCar and Pendelum ( problem with approach 1 while multiple choices, PG distrbutions sig+mu, and in general with continous spaces )
  3. RBF encoding for better feature capturing
  4. whats wrong with manual engineering of rewards function -> aka networks likely does not overestimate at all...
  5. HER on acrobot
  6. minute of thinking about ReplayBuffer ( with curiosity, with her, .. )
  7. my view why ReLU in NN layers is likely to be enough ( introducing non-linearity while preserving fair-enough gradients .. idea when thinking about softmax on last layer ~ logits vs output )
  8. MDP vs history on states ( multiple frames vs/combine RNN-alike-Memory )
  9. multiple tasks vs multiple critics and different reward functions
  10. multiple agents cooperating ( with potentionally different algorithms )

important references ( more to add ) :