Reinforcement Learning

This introduction to Reinforcement Learning.


Introduction of Reinforcement Learning


What is Reinforcement Learning ?

概述增強式學習 (Reinforcement Learning, RL) (一)


Policy Gradient

Blog: DRL Lecture 1: Policy Gradient (Review)


Actor-Critic


Reward Shaping


Algorithms

Taxonomy of RL Algorithms

Blog: Kinds of RL Alogrithms

  • Value-based methods : Deep Q Learning
    • Where we learn a value function that will map each state action pair to a value.
  • Policy-based methods : Reinforce with Policy Gradients
    • where we directly optimize the policy without using a value function
    • This is useful when the action space is continuous (連續) or stochastic (隨機)
    • use total rewards of the episode
  • Hybrid methods : Actor-Critic
    • a Critic that measures how good the action taken is (value-based)
    • an Actor that controls how our agent behaves (policy-based)
  • Model-based methods : Partially-Observable Markov Decision Process (POMDP)
    • State-transition models
    • Observation-transition models

List of RL Algorithms

  1. Q-Learning
  2. A2C (Actor-Critic Algorithms): Actor-Critic Algorithms
  3. DQN (Deep Q-Networks): 1312.5602
  4. TRPO (Trust Region Policy Optimizaton): 1502.05477
  5. DDPG (Deep Deterministic Policy Gradient): 1509.02971
  6. DDQN (Deep Reinforcement Learning with Double Q-learning): 1509.06461
  7. DD-Qnet (Double Dueling Q Net): 1511.06581
  8. A3C (Asynchronous Advantage Actor-Critic): 1602.01783
  9. ICM (Intrinsic Curiosity Module): 1705.05363
  10. I2A (Imagination-Augmented Agents): 1707.06203
  11. PPO (Proximal Policy Optimization): 1707.06347
  12. C51 (Categorical 51-Atom DQN): 1707.06887
  13. HER (Hindsight Experience Replay): 1707.01495
  14. MBMF (Model-Based RL with Model-Free Fine-Tuning): 1708.02596
  15. Rainbow (Combining Improvements in Deep Reinforcement Learning): 1710.02298
  16. QR-DQN (Quantile Regression DQN): 1710.10044
  17. AlphaZero : 1712.01815
  18. SAC (Soft Actor-Critic): 1801.01290
  19. TD3 (Twin Delayed DDPG): 1802.09477
  20. MBVE (Model-Based Value Expansion): 1803.00101
  21. World Models: 1803.10122
  22. IQN (Implicit Quantile Networks for Distributional Reinforcement Learning): 1806.06923
  23. SHER (Soft Hindsight Experience Replay): 2002.02089
  24. LAC (Actor-Critic with Stability Guarantee): 2004.14288
  25. AGAC (Adversarially Guided Actor-Critic): 2102.04376
  26. TATD3 (Twin actor twin delayed deep deterministic policy gradient learning for batch process control): 2102.13012
  27. SACHER (Soft Actor-Critic with Hindsight Experience Replay Approach): 2106.01016
  28. MHER (Model-based Hindsight Experience Replay): 2107.00306

Open Environments

Best Benchmarks for Reinforcement Learning: The Ultimate List

  • AI Habitat – Virtual embodiment; Photorealistic & efficient 3D simulator;
  • Behaviour Suite – Test core RL capabilities; Fundamental research; Evaluate generalization;
  • DeepMind Control Suite – Continuous control; Physics-based simulation; Creating environments;
  • DeepMind Lab – 3D navigation; Puzzle-solving;
  • DeepMind Memory Task Suite – Require memory; Evaluate generalization;
  • DeepMind Psychlab – Require memory; Evaluate generalization;
  • Google Research Football – Multi-task; Single-/Multi-agent; Creating environments;
  • Meta-World – Meta-RL; Multi-task;
  • MineRL – Imitation learning; Offline RL; 3D navigation; Puzzle-solving;
  • Multiagent emergence environments – Multi-agent; Creating environments; Emergence behavior;
  • OpenAI Gym – Continuous control; Physics-based simulation; Classic video games; RAM state as observations;
  • OpenAI Gym Retro – Classic video games; RAM state as observations;
  • OpenSpiel – Classic board games; Search and planning; Single-/Multi-agent;
  • Procgen Benchmark – Evaluate generalization; Procedurally-generated;
  • PyBullet Gymperium – Continuous control; Physics-based simulation; MuJoCo unpaid alternative;
  • Real-World Reinforcement Learning – Continuous control; Physics-based simulation; Adversarial examples;
  • RLCard – Classic card games; Search and planning; Single-/Multi-agent;
  • RL Unplugged – Offline RL; Imitation learning; Datasets for the common benchmarks;
  • Screeps – Compete with others; Sandbox; MMO for programmers;
  • Serpent.AI – Game Agent Framework – Turn ANY video game into the RL env;
  • StarCraft II Learning Environment – Rich action and observation spaces; Multi-agent; Multi-task;
  • The Unity Machine Learning Agents Toolkit (ML-Agents) – Create environments; Curriculum learning; Single-/Multi-agent; Imitation learning;
  • WordCraft -Test core capabilities; Commonsense knowledge;

OpenAI Gym

Reinforcement Learning 健身房


Stable Baselines 3

RL Algorithms in PyTorch : A2C, DDPG, DQN, HER, PPO, SAC, TD3.
QR-DQN, TQC, Maskable PPO are in SB3 Contrib
SB3 examples
pip install stable-baselines3
For Ubuntu: pip install gym[atari]
For Win10 : pip install --no-index -f ttps://github.com/Kojoley/atari-py/releases atari-py
Downloading and installing visual studio 2015-2019 x86 and x64 from here


Q Learning

Blog: A Hands-On Introduction to Deep Q-Learning using OpenAI Gym in Python


Blog: An introduction to Deep Q-Learning: let’s play Doom


DQN

Paper: Playing Atari with Deep Reinforcement Learning

PyTorch Tutorial
Gym Cartpole: dqn.py


DQN RoboCar

Blog: Deep Reinforcement Learning on ESP32
Code: Policy-Gradient-Network-Arduino


DQN for MPPT control

Paper: A Deep Reinforcement Learning-Based MPPT Control for PV Systems under Partial Shading Condition


DDQN

Paper: Deep Reinforcement Learning with Double Q-learning
Tutorial: Train a Mario-Playing RL Agent
Code: MadMario


Duel DQN

Paper: Dueling Network Architectures for Deep Reinforcement Learning

Double Duel Q Net

Code: mattbui/dd_qnet


A2C

Paper: Actor-Critic Algorithms

  • The “Critic” estimates the value function. This could be the action-value (the Q value) or state-value (the V value).
  • The “Actor” updates the policy distribution in the direction suggested by the Critic (such as with policy gradients).
  • A2C: Instead of having the critic to learn the Q values, we make him learn the Advantage values.

A3C

Paper: Asynchronous Methods for Deep Reinforcement Learning
Blog: The idea behind Actor-Critics and how A2C and A3C improve them
Blog: 李宏毅_ATDL_Lecture_23


DDPG

Paper: Continuous control with deep reinforcement learning
Blog: Deep Deterministic Policy Gradients Explained
Blog: 人工智慧-Deep Deterministic Policy Gradient (DDPG)
DDPG是在A2C中加入經驗回放記憶體,在訓練的過程中會持續的收集經驗,並且會設定一個buffer size,這個值代表要收集多少筆經驗,每當經驗庫滿了之後,每多一個經驗則最先收集到的經驗就會被丟棄,因此可以讓經驗庫一值保持滿的狀態並且避免無限制的收集資料造成電腦記憶體塞滿。
學習的時候則是從這個經驗庫中隨機抽取成群(batch)經驗來訓練DDPG網路,周而復始的不斷進行學習最終網路就能達到收斂狀態,請參考下圖DDPG演算架構圖。
Code: End to end motion planner using Deep Deterministic Policy Gradient (DDPG) in gazebo


Intrinsic Curiosity Module (ICM)

Paper: Curiosity-driven Exploration by Self-supervised Prediction
Code: pathak22/noreward-rl


PPO

Paper: Proximal Policy Optimization
On-policy vs Off-policy
On-Policy 方式是指用於學習的agent與觀察環境的agent是同一個,所以引數θ始終保持一致。(邊做邊學)
Off-Policy方式是指用於學習的agent與用於觀察環境的agent不是同一個,他們的引數θ可能不一樣。(在旁邊透過看別人做來學習)
比如下圍棋,On-Policy方式是agent親歷親為,而Off-Policy是一個agent看其他的agent下棋,然後去學習人家的東西。


TRPO

Paper: Trust Region Policy Optimization
Blog: Trust Region Policy Optimization講解
TRPO 算法 (Trust Region Policy Optimization)和PPO 算法 (Proximal Policy Optimization)都屬於MM(Minorize-Maximizatio)算法。


HER

Paper: Hindsight Experience Replay
Code: OpenAI HER


MBMF

Paper: Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning


SAC

Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor


TD3

Paper: Addressing Function Approximation Error in Actor-Critic Methods
Code: sfujim/TD3
TD3 with RAMDP


POMDP (Partially-Observable Markov Decision Process)

Paper: Planning and acting in partially observable stochastic domains


SHER

Paper: Soft Hindsight Experience Replay


Exercises: RL-gym

Downloading and installing visual studio 2015-2019 x86 and x64 from here

sudo apt-get install ffmpeg freeglut3-dev xvfb
pip install tensorflow
pip install pyglet==1.5.27
pip install stable_baselines3[extra]
pip install gym[all]
pip install autorom[accept-rom-license]
git clone https://github.com/rkuo2000/RL-gym
cd RL-gym
cd cartpole

~/RL-gym/cartpole

python3 random_action.py
python3 q_learning.py
python3 dqn.py


~/RL-gym/sb3/

alogrithm = A2C, output = xxx.zip
python3 train.py LunarLander-v2 640000
python3 enjoy.py LunarLander-v2
python3 enjoy_gif.py LunarLander-v2


Atari

env_name listed in Env_Name.txt
you can train on Kaggle, then download .zip to play on PC

python3 train_atari.py Pong-v0 1000000
python3 enjoy_atari.py Pong-v0
python3 enjoy_atari_gif.py Pong-v0


RL Baselines3 Zoo

PyBulletEnv
python enjoy.py --algo a2c --env AntBulletEnv-v0 --folder rl-trained-agents/ -n 5000
python enjoy.py --algo a2c --env HalfCheetahBulletEnv-v0 --folder rl-trained-agents/ -n 5000
python enjoy.py --algo a2c --env HopperBulletEnv-v0 --folder rl-trained-agents/ -n 5000
python enjoy.py --algo a2c --env Walker2DBulletEnv-v0 --folder rl-trained-agents/ -n 5000


Pybullet - Bullet Real-Time Physics Simulation


PyBullet-Gym

Code: rkuo2000/pybullet-gym

  • installation
    pip install gym
    pip install pybullet
    pip install stable-baselines3
    git clone https://github.com/rkuo2000/pybullet-gym
    export PYTHONPATH=$PATH:/home/yourname/pybullet-gym
    

gym

Env names: Ant, Atlas, HalfCheetah, Hopper, Humanoid, HumanoidFlagrun, HumanoidFlagrunHarder, InvertedPendulum, InvertedDoublePendulum, InvertedPendulumSwingup, Reacher, Walker2D

Blog:
Creating OpenAI Gym Environments with PyBullet (Part 1)
Creating OpenAI Gym Environments with PyBullet (Part 2)


OpenAI Gym Environments for Donkey Car


Google Dopamine

Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.
Dopamine supports the following agents, implemented with jax: DQN, C51, Rainbow, IQN, SAC.


ViZDoom

sudo apt install cmake libboost-all-dev libsdl2-dev libfreetype6-dev libgl1-mesa-dev libglu1-mesa-dev libpng-dev libjpeg-dev libbz2-dev libfluidsynth-dev libgme-ev libopenal-dev zlib1g-dev timidity tar nasm
pip install vizdoom


AI in Games

Paper: AI in Games: Techniques, Challenges and Opportunities


AlphaGo

2016 年 3 月,AlphaGo 這一台 AI 思維的機器挑戰世界圍棋冠軍李世石(Lee Sedol)。比賽結果以 4 比 1 的分數,AlphaGo 壓倒性的擊倒人類世界最會下圍棋的男人。
Paper: Mastering the game of Go with deep neural networks and tree search
Paper: Mastering the game of Go without human knowledge
Blog: Day 27 / DL x RL / 令世界驚艷的 AlphaGo

AlphaGo model 主要包含三個元件:

  • Policy network:根據盤面預測下一個落點的機率。
  • Value network:根據盤面預測最終獲勝的機率,類似預測盤面對兩方的優劣。
  • Monte Carlo tree search (MCTS):類似在腦中計算後面幾步棋,根據幾步之後的結果估計現在各個落點的優劣。

  • Policy Networks: 給定 input state,會 output 每個 action 的機率。
    AlphaGo 中包含三種 policy network:
  • Supervised learning (SL) policy network
  • Reinforcement learning (RL) policy network
  • Rollout policy network

  • Value Network: 預測勝率,Input 是 state,output 是勝率值。
    這個 network 也可以用 supervised learning 訓練,data 是歷史對局中的 state-outcome pair,loss 是 mean squared error (MSE)。

  • Monte Carlo Tree Search (MCTS): 結合這些 network 做 planning,決定遊戲進行時的下一步。
    1. Selection:從 root 開始,藉由 policy network 預測下一步落點的機率,來選擇要繼續往下面哪一步計算。選擇中還要考量每個 state-action pair 出現過的次數,盡量避免重複走同一條路,以平衡 exploration 和 exploitation。重複這個步驟直到樹的深度達到 max depth L。
    2. Expansion:到達 max depth 後的 leaf node sL,我們想要估計這個 node 的勝算。首先從 sL 往下 expand 一層。
    3. Evaluation:每個 sL 的 child node 會開始 rollout,也就是跟著 rollout policy network 預測的 action 開始往下走一陣子,取得 outcome z。最後 child node 的勝算會是 value network 對這個 node 預測的勝率和 z 的結合。
    4. Backup:sL 會根據每個 child node 的勝率更新自己的勝率,並往回 backup,讓從 root 到 sL 的每個 node 都更新勝率。

AlphaZero

2017 年 10 月,AlphaGo Zero 以 100 比 0 打敗 AlphaGo。
Blog: AlphaGo beat the world’s best Go player. He helped engineer the program that whipped AlphaGo.
Paper: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
AlphaGo 用兩個類神經網路,分別估計策略函數和價值函數。AlphaZero 用一個多輸出的類神經網路
AlphaZero 的策略函數訓練方式是直接減少類神經網路與MCTS搜尋出來的πₜ之間的差距,這就是在做regression,而 AlpahGo 原本用的方式是RL演算法做 Policy gradient。(πₜ:當時MCTS後的動作機率值)
Blog: 優拓 Paper Note ep.13: AlphaGo Zero
Blog: Monte Carlo Tree Search (MCTS) in AlphaGo Zero
Blog: The 3 Tricks That Made AlphaGo Zero Work

  1. MTCS with intelligent lookahead search
  2. Two-headed Neural Network Architecture
  3. Using residual neural network architecture 


AlphaZero with a Learned Model

Paper: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
RL can be divided into Model-Based RL (MBRL) and Model-Free RL (MFRL). Model-based RL uses an environment model for planning, whereas model-free RL learns the optimal policy directly from interactions. Model-based RL has achieved superhuman level of performance in Chess, Go, and Shogi, where the model is given and the game requires sophisticated lookahead. However, model-free RL performs better in environments with high-dimensional observations where the model must be learned.


Minigo

Code: tensorflow minigo


ELF OpenGo

Code: https://github.com/pytorch/ELF
Blog: A new ELF OpenGo bot and analysis of historical Go games


Chess Zero

Code: Zeta36/chess-alpha-zero


AlphaStar

Blog: AlphaStar: Mastering the real-time strategy game StarCraft II
Blog: AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Code: PySC2 - StarCraft II Learning Environment


OpenAI Five at Dota2


DeepMind FTW


Texas Hold’em Poker

Code: fedden/poker_ai
Code: Pluribus Poker AI + poker table
Blog: Artificial Intelligence Masters The Game of Poker – What Does That Mean For Humans?


Suphx

Paper: 2003.13590
Blog: 微软超级麻将AI Suphx论文发布,研发团队深度揭秘技术细节


DouZero

Paper: 2106.06135
Code: kwai/DouZero
Demo: douzero.org/


JueWu

Paper: Supervised Learning Achieves Human-Level Performance in MOBA Games: A Case Study of Honor of Kings
Blog: Tencent AI ‘Juewu’ Beats Top MOBA Gamers


StarCraft Commander

启元世界
Paper: SCC: an efficient deep reinforcement learning agent mastering the game of StarCraft II


Hanabi ToM

Paper: Theory of Mind for Deep Reinforcement Learning in Hanabi
Code: mwalton/ToM-hanabi-neurips19
Hanabi (from Japanese 花火, fireworks) is a cooperative card game created by French game designer Antoine Bauza and published in 2010.


MARL (Multi-Agent Reinforcement Learning)

Neural MMO

Paper: The Neural MMO Platform for Massively Multiagent Research
Blog: User Guide


Multi-Agent Locomotion

Paper: Emergent Coordination Through Competition
Code: Locomotion task library
Code: DeepMind MuJoCo Multi-Agent Soccer Environment


Unity ML-agents Toolkit

Code: Unity ML-Agents Toolkit

Blog: A hands-on introduction to deep reinforcement learning using Unity ML-Agents


DDPG Actor-Critic Reinforcement Learning Reacher Environment

Code: https://github.com/Remtasya/DDPG-Actor-Critic-Reinforcement-Learning-Reacher-Environment


Multi-Agent Mobile Manipulation

Paper: Spatial Intention Maps for Multi-Agent Mobile Manipulation
Code: jimmyyhwu/spatial-intention-maps


DeepMind Cultural Transmission

Paper Learning few-shot imitation as cultural transmission
Blog: DeepMind智慧體訓練引入GoalCycle3D
以模仿開始,然後深度強化學習繼續最佳化甚至找到超越前者的實驗,顯示AI智慧體能觀察別的智慧體學習並模仿。
這從零樣本開始,即時取得利用資訊的能力,非常接近人類積累和提煉知識的方式。


Imitation Learning

Blog: A brief overview of Imitation Learning


Self-Imitation Learning

directly use past good experiences to train current policy.
Paper: Self-Imitation Learming
Code: junhyukoh/self-imitation-learning
Blog: [Paper Notes 2] Self-Imitation Learning


Self-Imitation Learning by Planning

Paper: Self-Imitation Learning by Planning


Surgical Robotics

Paper: Open-Sourced Reinforcement Learning Environments for Surgical Robotics
Code: RL Environments for the da Vinci Surgical System


Meta Learning (Learning to Learn)

Blog: Meta-Learning: Learning to Learn Fast

Meta-Learning Survey

Paper: Meta-Learning in Neural Networks: A Survey


MAML (Model-Agnostic Meta-Learning)

Paper: Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Code: cbfinn/maml_rl


Reptile

Paper: On First-Order Meta-Learning Algorithms
Code: openai/supervised-reptile


MAML++

Paper: How to train your MAML
Code: AntreasAntoniou/HowToTrainYourMAMLPytorch
Blog: 元學習——從MAML到MAML++


Paper: First-order Meta-Learned Initialization for Faster Adaptation in Deep Reinforcement Learning


FAMLE (Fast Adaption by Meta-Learning Embeddings)

Paper: Fast Online Adaptation in Robotics through Meta-Learning Embeddings of Simulated Priors


Bootstrapped Meta-Learning

Paper: Bootstrapped Meta-Learning
Blog: DeepMind’s Bootstrapped Meta-Learning Enables Meta Learners to Teach Themselves


Unsupervised Learning

Understanding the World Through Action

Blog: Understanding the World Through Action: RL as a Foundation for Scalable Self-Supervised Learning
Paper: Understanding the World Through Action
Actionable Models
a self-supervised real-world robotic manipulation system trained with offline RL, performing various goal-reaching tasks. Actionable Models can also serve as general pretraining that accelerates acquisition of downstream tasks specified via conventional rewards.


RL-Stock

Kaggle: https://www.kaggle.com/rkuo2000/stock-lstm
Kaggle: https://kaggle.com/rkuo2000/stock-dqn


Stock Trading

Blog: Predicting Stock Prices using Reinforcement Learning (with Python Code!)

Code: DQN-DDPG_Stock_Trading
Code: FinRL
Blog: Automated stock trading using Deep Reinforcement Learning with Fundamental Indicators


FinRL

Papers:
2010.14194: Learning Financial Asset-Specific Trading Rules via Deep Reinforcement Learning
2011.09607: FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance
2101.03867: A Reinforcement Learning Based Encoder-Decoder Framework for Learning Stock Trading Rules
2106.00123: Deep Reinforcement Learning in Quantitative Algorithmic Trading: A Review
2111.05188: FinRL-Podracer: High Performance and Scalable Deep Reinforcement Learning for Quantitative Finance
2112.06753: FinRL-Meta: A Universe of Near-Real Market Environments for Data-Driven Deep Reinforcement Learning in Quantitative Finance

Blog: FinRL­-Meta: A Universe of Near Real-Market En­vironments for Data­-Driven Financial Reinforcement Learning
Code: DQN-DDPG_Stock_Trading
Code: FinRL



This site was last updated June 29, 2024.