AI611_review : 10 reviews and some simple reviews

Writer : sejik park

Review 1 : Proximal Policy Optimization Algorithms

Proximal policy optimization (PPO) is an on-policy algorithm which optimize clipped surrogate objective. It have some of benefits of trust region policy optimization (TRPO), which have the proof about monotonic improvement. Because PPO’s clip algorithm works as ignoring the change in probability ratio for the positive object and including it for the negative object. In other words, PPO removes the incentive for moving outside of the defined interval. This article test not only clipped algorithm but also adaptive KL penalty. But penalizing on KL divergence has worse performance than clipped method. Overall, PPO outperform over A2C, A2C with trust region, cross entropy method, vanilla policy gradient, and TRPO in the continuous control benchmark. For the discussion point, finding the optimal entropy coefficient without greedy search could be a point, and meta-optimization could be one way to solve this problem.

Review 2 : Soft Actor Critic Algorithms and Applications

Soft Actor-Critic (SAC) is off-policy actor-critic algorithm based on the maximum entropy RL framework which is to learn the task while acting as randomly as possible. In other words, SAC aims to maximize the entropy which makes the policy to explore more widely. As a result, it can capture multiple modes of near-optimal behavior which is helpful for solving model-free reinforcement problem about brittleness to hypermeters. To accomplish this goal, authors theoretically show soft policy iteration convergence and formulate the practical SAC algorithm. In detail, SAC train separated policy and value function networks for off-policy. Then, it minimize the expected KL-divergence to learn the policy with soft Q-function. In addition, SAC with adjusting the temperature automatically is also suggested. In conclusion, they show the performance over the off-policy TD3 and the on-policy PPO on the OpenAI gym and the rllab impletation of the Humanoid environment. Also, Two real-world robot (quadrupedal robot, and dexterous hand) tasks can be learned directly end-to-end with SAC. For the discussion point, although partition function, which normalizes the Q value distribution, was excluded because it does not affect the learning, I think it could be helpful to stabilize the training by not excluding normalization process using bounded discrete distributional value function.

Review 3 : Prioritized Experience Replay

Prioritized Experience Replay (PER) is an algorithm which sample transitions with priority for more efficient and effective learning than an algorithm which sample uniformly. Authors measure the magnitude of temporal-difference error for priority. Then, this priority is used to calculate four kinds of prioritization: greedy, oracle, proportional based, rank-based. Greedy prioritization are only updated for the transitions that are replayed, other kinds of prioritization work better. But, other prioritization can also lead to a loss of diversity which introduce bias. So, PER correct bias with importance sampling. Overall, experiment results is as follows. For a toy example, oracle, which sample transition that maximally reduces the global loss, improves the efficiency on the Blind Cliffwalk environment where is sparse reward setting. And experiments on Atari show proportional based and rank-based prioritization can improve the training efficiency than double DQN by a factor 2. Also, PER is effective for reducing first grounded delay about some games in Atari. For the discussion point, there may be other reasons why the transition should be prioritized. For example, some transitions can be helpful for exploration, other transitions can be used for avoiding forgetting. So, I am curious about the pros and cons about learnable prioritization if it exists. Or, I think it can be solved with meta-gradient.

Review 4 : Dream to Control Learning Behaviors by Latent Imagination

Dreamer shows how to use the model with latent imagination for improving model-based reinforcement learning. Dreamer encode states with representation model, transition model, and reward model. In detail, authors use mutual information between model states and observations instead of predicting states from images. Then, it generate imagined trajectories with its imagined states and actions for learning behaviors. With imagined trajectories, dreamer predicts state values and actions to maximize value predictions. In conclusion, dreamers’ agent act in the environment by encoding the history to state and predict the next action with this state. As a result, authors show that dreamer learn efficiently and robustness to long horizon. For the discussion point, I think model prediction could be enhanced with transformer architecture.

Review 5 : IQ-Learn: Inverse soft-Q Learning for Imitation

Inverse soft-Q learning is an framework for imitation learning. This solve the problem at very sparse expert data setting by using a single Q-function instead of adversarial optimization process which can be with high variance gradient estimators. In other words, soft-Q learning is dynamics-aware imitation learning which implicitly representing both reward and policy. Its implicit functions are learned by modified Q-function which integrated adversarial learning’s two-step process: 1. reward learning by assigning high reward to the expert policy and low reward to other ones, 2. policy improvement by searching best policy under reward function. As a result, with regularization, soft-Q learning can calculate the reward which is highly correlated with the real reward. Also, for the experiments with spare expert data, it outperforms prior methods in offline control task, Atari, and Mujoco. For the discussion point, sometimes its implicit reward can be not flexibly learned. For example, using only expert data can achieve higher performance than using all of data. So, I think it would be helpful to use an algorithm that can be prioritized for expert data which can be used without excluding the other data.

Review 6 : Generative Adversarial Imitation Learning

Generative adversarial imitation learning is a framework which combines state-action adversarial learning and KL-constrained natural gradient step. Authors show the relationship between generative adversarial imitation learning and inverse reinforcement learning. Then, they explain generative adversarial imitation learning as directly policy learning algorithm which bypass intermediate inverse reinforcement learning. In conclusion, generative adversarial imitation learning improve efficiency compared to behavior cloning algorithms (FEM, and GTAL) under 9 physics-based control tasks with TRPO experts data. And it has better performance than behavior cloning in not richer task. For the discussion, I think, for adversarial training, generating trajectory instead of sampling trajectory could be helpful. Because discriminator could be biased under existing trajectories and generator could help use more diverse trajectory under given data.

Review 7 : Conservative Q-Learning for Offline Reinforcement Learning

Conservative Q-learning (CQL) is an algorithm to avoid the failure which induced by overestimating of values by the distribution shift from out-of-distribution actions. To achieve this goal, CQL minimize and tighten Q value under an appropriate chosen distribution in the dataset. In other word, it update itself with the lower bound which is penalize state marginal Q value and complement Q value over its current behavior policy. As a result, CQL Q-function can prevent out-of-distributed action by gap-expanding which means expanding the difference in Q-values at in-distribution actions and out-of-distribution actions more than its actual difference. In conclusion, CQL shows significant performance improvement over SAC on continuous control datasets from the D4RL benchmark. For the discussion point, I wonder the performance of CQL on multi-task environment. Because conservative speculation does not take into account the factors that can be generalized. If it is right concern, I think adversarial learning between conservation and generalization could be a solution.

Review 8 : OFFLINE REINFORCEMENT LEARNING WITH IMPLICIT Q-LEARNING

Implicit Q-learning (IQL) is an offline reinforcement learning method that avoid value function overestimation by approximating an upper expectile. This expectile is defined as an asymmetric L1 loss that asymmetric intensity can defines the entire spectrum of methods between SARSA and Q-Learning. However, there is an problem about the existence of a single action over a large target value. So, IQL resolve it by implicitly treating the state value function. Then, IQL estimate Q-function and extract the policy via advantage-weighted regression which can control weight hyperparameter to define an algorithm between behavior cloning and maximum of the Q-function. In conclusion, under certain assumptions, IQL indeed approximates the optimal state-action value Q and performs multi-step dynamical programming. For the discussion point, I think IQL define general algorithm which can be diverse algorithm by controlling hyperparameter. But the results seems to be complicated to understand. So, It would be good to find the best algorithm through learning (i.e. meta learning) and make a neat result table.

Review 9 : Decision Transformer: Reinforcement Learning via Sequence Modeling

Decision Transformer is a causally masked transformer which predicts the optimal actions by conditioning on the desired return, past states, and actions. Transformer architecture has benefit to learn sparse reward environment because it use self-attention which can attend to every previous state and action. In conclusion, the performance is comparable with or better than previous best methods on offline RL benchmarks in Atari, OpenAI Gym, and Key-to-Door. In addition, authors show transformer perform better than behavior cloning on both plenty data regime and low data regime. Also, decision transformer could make the trajectory with the desired return which is correlated with the target return. Lastly, it shows linear approximation over the best trajectory’s return which means that it is generalized to not-shown data distribution. For the discussion point, it use the masked information for the autoregressive generation but I think there could be some more information which could be delivered from bi-directional (or reverse) way. For example, since decision transformer use the overflowed maximum reward because it is difficult to give a direct best reward for the task, it is possible to predict diverse future trajectories and make a value by these trajectories to use as an additional information for correcting the overly given reward into precisely desired return.

Review 10 : Data-Efficient Hierarchical Reinforcement Learning

HIerarchical Reinforcement learning with Off-policy correction (HIRO) solve non-stationary training problem of two-layer hierarchical off-policy that lower-level controllers are supervised with goals which is generated by the higher-level controllers and have meaning about desired relative changes in state observations. To solve non-stationary training problem, HIRO use off-policy correction which re-labels the past transition’s goal with the new goal which could be most achievable with the transition’s actions among the gaussian centered random goal. As a result, HIRO is able to use past experience more stable and it perform better than FuN, SNN4HRL, and VIME on the Ant related environment. For the discussion point, combination of modifying the goal and correcting the action to match the goal could be helpful to learn more about the difficult goal. Because only modifying the goal can be biased to easier target corresponding to current actions. In other words, there can be more opportunities to learn about difficult goals with combination algorithm. In addition, it would be nice to show experiments on other environments which less require hierarchical learning because I want to know whether hierarchical algorithm interferes learning other information which is less hierarchical.

Simple Review : Playing hard exploration games by watching

This paper uses video information for self-supervised learning to excel in sparse reward environments. The learning stage consists of two stages: self-supervised learning and agent learning. For self-supervised learning, it learn the gap between frames based on images and sounds to solve the domain gap between YouTube video and Atari. Then, the agent is learned by defining an imitation learning reward based on the learned representation. Finally, it shows good results for Montezuma's revenge, which is a hard exploration game.

Simple Review : Mean Field Multi-Agent Reinforcement Learning

This paper designs a Q function that considers only agents around itself to reduce problem complexity in a multi-agent environment.

Simple Review : Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

This paper suggests a method to learn multi-agent environments using centralized Q learning and ensemble policy. Authors show that multi-agent performance improves over DDPG for toy environments such as cooperative communication and predictor-prey.

Simple Review : Discovering faster matrix multiplication algorithms with reinforcement learning

This paper is learned to find the optimal computational path with addition and multiplication game.

Simple Review : Grandmaster level in StarCraft II using multi-agent reinforcement learning

This paper applied reinforcement learning in complex strategy game. Authors utilized clipped importance sampling and self-imitation learning for agent learning, and developed learning methods to create competition with continuous performance improvement. For example, through fictitious self-play, agent can refer to strategies that have mutual advantages and disadvantages. In addition, for the competitive learning process, authors placed constraints on the opponent for performance and overall league status judgment. Also, they modified multi-agent learning to steadily increase performance. Finally, it shows good performance even if it is set up to be fair to people’s operating ability.