RL - Introduction

18 Mar 2019

들어가며

이 시리즈는 OpenAI 에서 나온 spinning up 에 소개된 key papers 를 중심으로 최근 RL 에서 연구되고 있는 여러 분야들을 알아봅니다. 논문을 디테일하고 엄밀하게 살펴보기보다는 각 논문에서 제안하는 핵심적인 아이디어들을 직관적으로 설명하는 것에 중점을 두었습니다. 논문에 따라 디테일한 부분이나 이론적인 부분을 조금 자세히 설명하는 경우도 있지만, 모든 디테일들이 궁금하다면 논문을 함께 참조하기 바랍니다.

이후의 내용들은 편의상 평어로 작성되었습니다.

Overview

taxonomy OpenAI spinning up 의 RL 알고리즘 분류

모든 분류가 그렇듯이 RL 알고리즘도 위와 같은 트리 구조로 분류하기에는 어려움이 있다. 전체적인 분류 형태를 참조만 하도록 하자.

Spinning up 에서 소개하고 있는 모든 key papers 를 여기서 다루는 것은 아니며, 반대로 여기에서 소개하는 논문이 key papers 에 없는 경우도 있다. 이 시리즈에서 다룰 논문들은 다음과 같다:

참고)

아래 논문 리스트 및 순서는 확정된 것이 아니며, 작성 과정에서 수정될 수 있음
일부 2번씩 등장하는 논문들이 있는데, 추후 해당 섹션까지 글이 작성될 때 분류를 재조정할 예정인 논문들임

Deep Q-learning
- DQN (Deep Q-Networks)
- DDQN (Double DQN)
- Dueling DQN
- PER (Priortized Experience Replay)
- C51
- NoisyNet
- Rainbow
Policy Gradients
- REINFORCE
- Actor-Critic
- Off-policy Actor-Critic
- A3C (Asynchronous Advantage Actor-Critic)
- GAE (Generalized Advantages Estimation)
- NPG (Natural Policy Gradient)
- TRPO (Trust Region Policy Optimization)
- PPO (Proximal Policy Optimization)
Deterministic Policy Gradients
- DPG (Deterministic Policy Gradients)
- DDPG (Deep DPG)
- TD3 (Twin Delayed DDPG)
Entropy Regularization
- SAC (Soft Actor-Critic)
- TAC (Tsallis Actor-Critic)
Path-Consistency Learning
- PCL (Path Consistency Learning)
Intrinsic Motivation
- CTS-based Pseudocounts
- ICM (Intrinsic Curiosity Module)
- RND (Random Network Distillation)
Hierarchical RL
- FuN (Feudal Networks)
- HIRO (Hierarchical RL with Off-policy correction)
- STRAW (Strategic Attentive Writer)
Transfer and Multitask RL
- HER (Hindsight Experience Replay)
- PathNet
Meta-RL
- RL^2
- SNAIL (Simple Neural Attentive Learner)
- MAML (Model-Agnostic Meta-Learning)
from Demonstration
- DDPGfD (DDPG from Demonstration)
- DQfD (Deep Q-learning from Demonstration)
- Overcoming Exploration with Demonstrations
- Learning from Youtube
Model-based RL
- I2A (Imagination-Augmented Agents)
- World models
- AGZ (AlphaGo Zero) / AZ (Alpha Zero) / EXIT (Expert Iteration)
Scaling RL
- A3C (Asynchronous Advantage Actor-Critic)
- IMPALA (Importance Weighted Actor-Learner Architecture)
- Ape-X
IRL
- GAIL (Generative Adversarial Imitation Learning)

DQN 이나 REINFORCE 등 시작점이 되는 알고리즘들은 간단하게 다루기는 하나 기본적으로 알고 있다고 가정한다.