This project is the algorithm Soft Actor-Critic with a series of advanced features implemented by TensorFlow 2.x. It can be used to train Gym and Unity environments with ML-Agents.
- N-step
- V-trace (IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures)
- Prioritized Experience Replay (100% Numpy sumtree)
- *Episode Experience Replay
- R2D2 (Recurrent Experience Replay In Distributed Reinforcement Learning)
- *Representation model, Q function and policy strucutres
- *Recurrent Prediction Model
- Noisy Networks for Exploration (Noisy Networks for Exploration)
- Distributed training (Distributed Prioritized Experience Replay)
- Discrete action (Soft Actor-Critic for Discrete Action Settings)
- Curiosity mechanism (Curiosity-driven Exploration by Self-supervised Prediction)
- *Large-scale Distributed Evolutionary Reinforcement Learning
* denotes the features that we implemented.
Gym and Unity environments with ML-Agents.
Observation can be any combination of vectors and images, which means an agent can have multiple sensors and the resolution of each image can be different.
Action space can be continuous or discrete, but not supporting both continuous and discrete actions outputs.
Not supporting multi-agent environments.
All neural network models should be in a .py file (default nn.py
). All training configurations should be specified in config.yaml
.
Both neural network models and training configurations should be placed in the same folder under envs
.
All default training configurations are listed below. It can also be found in algorithm/default_config.yaml
base_config:
env_type: UNITY # UNITY or GYM
scene:
scene # The scene name.
# If in Unity envs, it indicates the specific scene.
# If in Gym envs, it is just a readable name displayed in TensorBoard
no_graphics: true # For Unity envs, if an env does not need pixel input, set true
# Only for Unity Environments
build_path: # Unity executable path
win32: path_win32
linux: path_linux
port: 5005
# Only for Gym Enviroments
# build_path: GymEnv # Like CartPole-v1
name: "{time}" # Training name. Placeholder "{time}" will be replaced to the time that trianing begins
nn: nn # Neural network models file
n_agents: 1 # N agents running in parallel
max_iter: -1 # Max iteration
max_step: -1 # Max step. Training will be terminated if max_iter or max_step encounters
max_step_each_iter: -1 # Max step in each iteration
reset_on_iteration: true # If to force reset agent if an episode terminated
reset_config: null # Reset parameters sent to Unity
replay_config:
batch_size: 256
capacity: 524288
alpha: 0.9 # [0~1] convert the importance of TD error to priority. If 0, PER will reduce to vanilla replay buffer
beta: 0.4 # Importance-sampling, from initial value increasing to 1
beta_increment_per_sampling: 0.001 # Increment step
td_error_min: 0.01 # Small amount to avoid zero priority
td_error_max: 1. # Clipped abs error
sac_config:
seed: null # Random seed
write_summary_per_step: 1000 # Write summaries in TensorBoard every N steps
save_model_per_step: 100000 # Save model every N steps
save_model_per_minute: 5 # Save model every N minutes
ensemble_q_num: 2 # Number of Qs
ensemble_q_sample: 2 # Number of min Qs
burn_in_step: 0 # Burn-in steps in R2D2
n_step: 1 # Update Q function by N steps
use_rnn: false # If use RNN
tau: 0.005 # Coefficient of updating target network
update_target_per_step: 1 # Update target network every N steps
init_log_alpha: -2.3 # The initial log_alpha
use_auto_alpha: true # If use automating entropy adjustment
learning_rate: 0.0003 # Learning rate of all optimizers
gamma: 0.99 # Discount factor
v_lambda: 1.0 # Discount factor for V-trace
v_rho: 1.0 # Rho for V-trace
v_c: 1.0 # C for V-trace
clip_epsilon: 0.2 # Epsilon for q and policy clip
discrete_dqn_like: false # If use policy or only Q network if discrete is in action spaces
use_priority: true # If use PER importance ratio
use_n_step_is: true # If use importance sampling
use_prediction: false # If train a transition model
transition_kl: 0.8 # The coefficient of KL of transition and standard normal
use_extra_data: true # If use extra data to train prediction model
use_curiosity: false # If use curiosity
curiosity_strength: 1 # Curiosity strength if use curiosity
use_rnd: false # If use RND
rnd_n_sample: 10 # RND sample times
use_normalization: false # If use observation normalization
All default distributed training configurations are listed below. It can also be found in ds/default_config.yaml
base_config:
env_type: UNITY # UNITY or GYM
scene:
scene # The scene name.
# If in Unity envs, it indicates the specific scene.
# If in Gym envs, it is just a readable name displayed in TensorBoard
no_graphics: true # For Unity envs, if an env does not need pixel input, set true
# Only for Unity Environments
build_path: # Unity executable path
win32: path_win32
linux: path_linux
port: 5005
# Only for Gym Enviroments
# build_path: GymEnv # Like CartPole-v1
name: "{time}" # Training name. Placeholder "{time}" will be replaced to the time that trianing begins
nn: nn # Neural network models file
update_policy_mode: true # update policy variables each "update_policy_variables_per_step" or get action from learner each step
update_policy_variables_per_step: -1 # -1 for policy variables being updated each iteration
update_sac_bak_per_step: 200 # Every N step update sac_bak
noise_increasing_rate: 0 # Noise = N * number of actors
noise_max: 0.1 # Max noise for actors
n_agents: 1 # N agents running in parallel
max_step_each_iter: -1 # Max step in each iteration
reset_on_iteration: true # If to force reset agent if an episode terminated
max_actors_each_learner: 10 # The max number of actors of each learner
evolver_cem_length: 50 # Start CEM if all learners have eavluated evolver_cem_length times
evolver_cem_best: 0.3 # The ratio of best learners
evolver_cem_min_length:
2 # Start CEM if all learners have eavluated `evolver_cem_min_length` times,
# and it has been more than `evolver_cem_time` minutes since the last update
evolver_cem_time: 3
evolver_remove_worst: 4
net_config:
evolver_host: 127.0.0.1
evolver_port: 61000
learner_host: 127.0.0.1
learner_port: 61001
replay_host: 127.0.0.1
replay_port: 61002
reset_config: null # Reset parameters sent to Unity
replay_config:
batch_size: 256
capacity: 524288
alpha: 0.9 # [0~1] convert the importance of TD error to priority. If 0, PER will reduce to vanilla replay buffer
beta: 0.4 # Importance-sampling, from initial value increasing to 1
beta_increment_per_sampling: 0.001 # Increment step
td_error_min: 0.01 # Small amount to avoid zero priority
td_error_max: 1. # Clipped abs error
sac_config:
seed: null # Random seed
write_summary_per_step: 1000 # Write summaries in TensorBoard every N steps
save_model_per_step: 100000 # Save model every N steps
save_model_per_minute: 5 # Save model every N minutes
ensemble_q_num: 2 # Number of Qs
ensemble_q_sample: 2 # Number of min Qs
burn_in_step: 0 # Burn-in steps in R2D2
n_step: 1 # Update Q function by N steps
use_rnn: false # If use RNN
tau: 0.005 # Coefficient of updating target network
update_target_per_step: 1 # Update target network every N steps
init_log_alpha: -2.3 # The initial log_alpha
use_auto_alpha: true # If use automating entropy adjustment
learning_rate: 0.0003 # Learning rate of all optimizers
gamma: 0.99 # Discount factor
v_lambda: 1.0 # Discount factor for V-trace
v_rho: 1.0 # Rho for V-trace
v_c: 1.0 # C for V-trace
clip_epsilon: 0.2 # Epsilon for q and policy clip
discrete_dqn_like: false # If use policy or only Q network if discrete is in action spaces
use_prediction: false # If train a transition model
use_extra_data: true # If use extra data to train prediction model
use_curiosity: false # If use curiosity
curiosity_strength: 1 # Curiosity strength if use curiosity
use_rnd: false # If use RND
rnd_n_sample: 10 # RND sample times
use_normalization: false # If use observation normalization
# random_params:
# param_name:
# in: [n1, n2, n3]
# truncated: [n1 ,n2]
# std: n
usage: main.py [-h] [--config CONFIG] [--run] [--render]
[--editor] [--logger_in_file] [--name NAME]
[--port PORT] [--nn NN] [--ckpt CKPT]
[--agents AGENTS] [--repeat REPEAT]
env
positional arguments:
env
optional arguments:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
config file
--run inference mode
--render render
--editor running in Unity Editor
--logger_in_file logging into a file
--name NAME, -n NAME training name
--port PORT, -p PORT communication port
--nn NN neural network model
--ckpt CKPT ckeckpoint to restore
--agents AGENTS number of agents
--repeat REPEAT number of repeated experiments
examples:
# Train gym environment mountain_car with name "test_{time}", 10 agents and repeating training two times
python main.py gym/mountain_car -n "test_{time}" --agents=10 --repeat=2
# Train unity environment roller with vanilla config and port 5006
python main.py roller -c vanilla -p 5006
# Inference unity environment roller with model "nowall_202003251644192jWy"
python main.py roller -c vanilla -n nowall_202003251644192jWy --run --agents=1
usage: main_ds.py [-h] [--config CONFIG] [--run]
[--evolver_host EVOLVER_HOST]
[--evolver_port EVOLVER_PORT]
[--learner_host LEARNER_HOST]
[--learner_port LEARNER_PORT]
[--replay_host REPLAY_HOST]
[--replay_port REPLAY_PORT] [--in_k8s]
[--render] [--editor] [--logger_in_file]
[--name NAME] [--build_port BUILD_PORT]
[--nn NN] [--ckpt CKPT] [--agents AGENTS]
[--noise NOISE]
env {replay,r,learner,l,actor,a,evolver,e}
positional arguments:
env
{replay,r,learner,l,actor,a,evolver,e}
optional arguments:
-h, --help show this help message and exit
--config CONFIG, -c CONFIG
config file
--run inference mode
--evolver_host EVOLVER_HOST
evolver host
--evolver_port EVOLVER_PORT
evolver port
--learner_host LEARNER_HOST
learner host
--learner_port LEARNER_PORT
learner port
--replay_host REPLAY_HOST
replay host
--replay_port REPLAY_PORT
replay port
--in_k8s
--render render
--editor running in Unity Editor
--logger_in_file logging into a file
--name NAME, -n NAME training name
--build_port BUILD_PORT, -p BUILD_PORT
communication port
--nn NN neural network model
--ckpt CKPT ckeckpoint to restore
--agents AGENTS number of agents
--noise NOISE additional noise for actor
examples:
python main_ds.py bullet/walker evolver --evolver_host=127.0.0.1 --learner_host=127.0.0.1 --replay_host=127.0.0.1 --logger_in_file
python main_ds.py bullet/walker learner --evolver_host=127.0.0.1 --learner_host=127.0.0.1 --replay_host=127.0.0.1 --logger_in_file
python main_ds.py bullet/walker replay --evolver_host=127.0.0.1 --learner_host=127.0.0.1 --replay_host=127.0.0.1 --logger_in_file
python main_ds.py bullet/walker actor --evolver_host=127.0.0.1 --learner_host=127.0.0.1 --replay_host=127.0.0.1 --logger_in_file