課程簡介
深度強化學習:原理、算法和應用
目標收益
- 幻燈片算法講解,結合代碼分析
- 深入講解強化學習各種算法設計、特點和異同
- 結合實際應用舉例和和業(yè)界趨勢分析
- 分析強化學習的演示實現(xiàn)代碼
培訓對象
1.對增強學習算法原理和應用感興趣,具有一定編程(Python)和數(shù)學基礎(線性代數(shù)、概率論)的技術人員。
2.對深度學習(deep learning)模型有一定了解為佳
課程內容
環(huán)境要求:
- Python 3.5 以上
- GPU:Nvidia GTX 960 以上機器
課程大綱
1. Reinforcement Learning 入門 |
- Reinforcement Learning 特點 - Reinforcement Learning 案例 - Reinforcement Learning 組成 - Rewards - Environment - History and State - Observation - Agent: Policy, Value, Model - 案例:迷宮學習 - Reinforcement Learning 分類 - Value Based - Policy Based - Actor Based - Model Free vs Model Based - Reinforcement Learning 中的順序決策 sequential decision making 問題 - Learning and Planning - 案例:電子游戲 Atari - Exploration and Exploitation - Prediction and Control |
2. 馬爾科夫決策過程 Markov Decision Processes (MDP) |
- Markov Processes 馬爾科夫過程 - Markov Reward Processes 馬爾科夫回報過程 - Markov Decision Processes 馬爾科夫決策過程 - MDP 擴展 |
3. 用動態(tài)規(guī)劃做計劃 Planning by Dynamic Programming |
- 策略評估 Policy Evaluation - 策略迭代 Policy Iteration - 價值迭代 Value Iteration - 動態(tài)規(guī)劃擴展 Extension to DP - 壓縮映射 Contraction Mapping |
4. 無模型預測 Model-Free Prediction |
- 蒙特卡羅學習 Monte-Carlo Learning - 時間差分學習 Temporal-Difference Learning - TD( λ) 學習 |
5. 無模型控制 Model-Free Control |
- 有策略蒙特卡羅控制 On-Policy Monte-Carlo Control - 有策略時間差分學習 On-Policy Temporal-Difference Learning - 無策略學習 Off-Policy Learning |
6. 價值函數(shù)近似 Value Function Approximation |
- 增量方法 Incremental Methods - 批量方法 Batch Methods |
7. 策略梯度法 Policy Gradient |
- 有限差分政策梯度 Finite Difference Policy Gradient - 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient - AC策略梯度 Actor-Critic Policy Gradient * Proximal Policy Optimization (PPO) - the default reinforcement learning algorithm at OpenAI * On-Policy v.s. Off-policy: Importance Sampling - Issue of Importance Sampling - On-Policy -> Off-policy - Add Constraint * PPO / TRPO * Q-Learning - Critic - Target Network - Replay Buffer - Tips of Q-Learning - Double DQN - Dueling DQN - Prioritized Reply - Noisy Net - Distributed Q-function - Rainbow - Q-Learning for Continuous Actions * Actor-Critic - A3C - Advantage Actor-Critic - Path-wise Derivative Policy Gradient * Imitation Learning - Behavior Cloning * Inverse Reinforcement Learning (IRL) - Framework of IRL - IRL and GAN * Sparse Reward - Curiosity - Curriculum Learning - Hierarchical Reinforcement Learning |
8. 整合學習和計劃 Integrating Learning and Planning |
- 基于模型的增強學習 Model-Based Reinforcement Learning - 整合架構 Integrated Architectures - 基于模擬的搜索 Simulation-Based Search |
9. 探索與開發(fā) Exploration and Exploitation |
- Multi-Armed Bandits 多臂 Bandit 裝置 - Contextual Bandits - MDPs |
10. 強化學習在游戲中的應用 |
- 博弈論概要 - 最小最大搜索 Minimax Search - 自對弈增強學習 Self-Play Reinforcement Learning - 結合強化學習和 Minimax 搜索 - 不完全信息游戲中的強化學習 RL in Imperfect-Information Games |
1. Reinforcement Learning 入門 - Reinforcement Learning 特點 - Reinforcement Learning 案例 - Reinforcement Learning 組成 - Rewards - Environment - History and State - Observation - Agent: Policy, Value, Model - 案例:迷宮學習 - Reinforcement Learning 分類 - Value Based - Policy Based - Actor Based - Model Free vs Model Based - Reinforcement Learning 中的順序決策 sequential decision making 問題 - Learning and Planning - 案例:電子游戲 Atari - Exploration and Exploitation - Prediction and Control |
2. 馬爾科夫決策過程 Markov Decision Processes (MDP) - Markov Processes 馬爾科夫過程 - Markov Reward Processes 馬爾科夫回報過程 - Markov Decision Processes 馬爾科夫決策過程 - MDP 擴展 |
3. 用動態(tài)規(guī)劃做計劃 Planning by Dynamic Programming - 策略評估 Policy Evaluation - 策略迭代 Policy Iteration - 價值迭代 Value Iteration - 動態(tài)規(guī)劃擴展 Extension to DP - 壓縮映射 Contraction Mapping |
4. 無模型預測 Model-Free Prediction - 蒙特卡羅學習 Monte-Carlo Learning - 時間差分學習 Temporal-Difference Learning - TD( λ) 學習 |
5. 無模型控制 Model-Free Control - 有策略蒙特卡羅控制 On-Policy Monte-Carlo Control - 有策略時間差分學習 On-Policy Temporal-Difference Learning - 無策略學習 Off-Policy Learning |
6. 價值函數(shù)近似 Value Function Approximation - 增量方法 Incremental Methods - 批量方法 Batch Methods |
7. 策略梯度法 Policy Gradient - 有限差分政策梯度 Finite Difference Policy Gradient - 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient - AC策略梯度 Actor-Critic Policy Gradient * Proximal Policy Optimization (PPO) - the default reinforcement learning algorithm at OpenAI * On-Policy v.s. Off-policy: Importance Sampling - Issue of Importance Sampling - On-Policy -> Off-policy - Add Constraint * PPO / TRPO * Q-Learning - Critic - Target Network - Replay Buffer - Tips of Q-Learning - Double DQN - Dueling DQN - Prioritized Reply - Noisy Net - Distributed Q-function - Rainbow - Q-Learning for Continuous Actions * Actor-Critic - A3C - Advantage Actor-Critic - Path-wise Derivative Policy Gradient * Imitation Learning - Behavior Cloning * Inverse Reinforcement Learning (IRL) - Framework of IRL - IRL and GAN * Sparse Reward - Curiosity - Curriculum Learning - Hierarchical Reinforcement Learning |
8. 整合學習和計劃 Integrating Learning and Planning - 基于模型的增強學習 Model-Based Reinforcement Learning - 整合架構 Integrated Architectures - 基于模擬的搜索 Simulation-Based Search |
9. 探索與開發(fā) Exploration and Exploitation - Multi-Armed Bandits 多臂 Bandit 裝置 - Contextual Bandits - MDPs |
10. 強化學習在游戲中的應用 - 博弈論概要 - 最小最大搜索 Minimax Search - 自對弈增強學習 Self-Play Reinforcement Learning - 結合強化學習和 Minimax 搜索 - 不完全信息游戲中的強化學習 RL in Imperfect-Information Games |