課程費用

7800.00 /人

課程時長

3

成為教練

課程簡介

深度強化學習:原理、算法和應用

目標收益

- 幻燈片算法講解,結合代碼分析
- 深入講解強化學習各種算法設計、特點和異同
- 結合實際應用舉例和和業(yè)界趨勢分析
- 分析強化學習的演示實現(xiàn)代碼

培訓對象

1.對增強學習算法原理和應用感興趣,具有一定編程(Python)和數(shù)學基礎(線性代數(shù)、概率論)的技術人員。
2.對深度學習(deep learning)模型有一定了解為佳

課程內容

環(huán)境要求:
- Python 3.5 以上
- GPU:Nvidia GTX 960 以上機器

課程大綱

1. Reinforcement Learning 入門 - Reinforcement Learning 特點
- Reinforcement Learning 案例
- Reinforcement Learning 組成
- Rewards
- Environment
- History and State
- Observation
- Agent: Policy, Value, Model
- 案例:迷宮學習
- Reinforcement Learning 分類
- Value Based
- Policy Based
- Actor Based
- Model Free vs Model Based
- Reinforcement Learning 中的順序決策 sequential decision making 問題
- Learning and Planning
- 案例:電子游戲 Atari
- Exploration and Exploitation
- Prediction and Control
2. 馬爾科夫決策過程 Markov Decision Processes (MDP) - Markov Processes 馬爾科夫過程
- Markov Reward Processes 馬爾科夫回報過程
- Markov Decision Processes 馬爾科夫決策過程
- MDP 擴展
3. 用動態(tài)規(guī)劃做計劃 Planning by Dynamic Programming - 策略評估 Policy Evaluation
- 策略迭代 Policy Iteration
- 價值迭代 Value Iteration
- 動態(tài)規(guī)劃擴展 Extension to DP
- 壓縮映射 Contraction Mapping
4. 無模型預測 Model-Free Prediction - 蒙特卡羅學習 Monte-Carlo Learning
- 時間差分學習 Temporal-Difference Learning
- TD( λ) 學習
5. 無模型控制 Model-Free Control - 有策略蒙特卡羅控制 On-Policy Monte-Carlo Control
- 有策略時間差分學習 On-Policy Temporal-Difference Learning
- 無策略學習 Off-Policy Learning
6. 價值函數(shù)近似 Value Function Approximation - 增量方法 Incremental Methods
- 批量方法 Batch Methods
7. 策略梯度法 Policy Gradient - 有限差分政策梯度 Finite Difference Policy Gradient
- 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient
- AC策略梯度 Actor-Critic Policy Gradient

* Proximal Policy Optimization (PPO)
- the default reinforcement learning algorithm at OpenAI

* On-Policy v.s. Off-policy: Importance Sampling
- Issue of Importance Sampling
- On-Policy -> Off-policy
- Add Constraint

* PPO / TRPO

* Q-Learning
- Critic
- Target Network
- Replay Buffer
- Tips of Q-Learning
- Double DQN
- Dueling DQN
- Prioritized Reply
- Noisy Net
- Distributed Q-function
- Rainbow
- Q-Learning for Continuous Actions

* Actor-Critic
- A3C
- Advantage Actor-Critic
- Path-wise Derivative Policy Gradient

* Imitation Learning
- Behavior Cloning

* Inverse Reinforcement Learning (IRL)
- Framework of IRL
- IRL and GAN

* Sparse Reward
- Curiosity
- Curriculum Learning
- Hierarchical Reinforcement Learning
8. 整合學習和計劃 Integrating Learning and Planning - 基于模型的增強學習 Model-Based Reinforcement Learning
- 整合架構 Integrated Architectures
- 基于模擬的搜索 Simulation-Based Search
9. 探索與開發(fā) Exploration and Exploitation - Multi-Armed Bandits 多臂 Bandit 裝置
- Contextual Bandits
- MDPs
10. 強化學習在游戲中的應用 - 博弈論概要
- 最小最大搜索 Minimax Search
- 自對弈增強學習 Self-Play Reinforcement Learning
- 結合強化學習和 Minimax 搜索
- 不完全信息游戲中的強化學習 RL in Imperfect-Information Games
1. Reinforcement Learning 入門
- Reinforcement Learning 特點
- Reinforcement Learning 案例
- Reinforcement Learning 組成
- Rewards
- Environment
- History and State
- Observation
- Agent: Policy, Value, Model
- 案例:迷宮學習
- Reinforcement Learning 分類
- Value Based
- Policy Based
- Actor Based
- Model Free vs Model Based
- Reinforcement Learning 中的順序決策 sequential decision making 問題
- Learning and Planning
- 案例:電子游戲 Atari
- Exploration and Exploitation
- Prediction and Control
2. 馬爾科夫決策過程 Markov Decision Processes (MDP)
- Markov Processes 馬爾科夫過程
- Markov Reward Processes 馬爾科夫回報過程
- Markov Decision Processes 馬爾科夫決策過程
- MDP 擴展
3. 用動態(tài)規(guī)劃做計劃 Planning by Dynamic Programming
- 策略評估 Policy Evaluation
- 策略迭代 Policy Iteration
- 價值迭代 Value Iteration
- 動態(tài)規(guī)劃擴展 Extension to DP
- 壓縮映射 Contraction Mapping
4. 無模型預測 Model-Free Prediction
- 蒙特卡羅學習 Monte-Carlo Learning
- 時間差分學習 Temporal-Difference Learning
- TD( λ) 學習
5. 無模型控制 Model-Free Control
- 有策略蒙特卡羅控制 On-Policy Monte-Carlo Control
- 有策略時間差分學習 On-Policy Temporal-Difference Learning
- 無策略學習 Off-Policy Learning
6. 價值函數(shù)近似 Value Function Approximation
- 增量方法 Incremental Methods
- 批量方法 Batch Methods
7. 策略梯度法 Policy Gradient
- 有限差分政策梯度 Finite Difference Policy Gradient
- 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient
- AC策略梯度 Actor-Critic Policy Gradient

* Proximal Policy Optimization (PPO)
- the default reinforcement learning algorithm at OpenAI

* On-Policy v.s. Off-policy: Importance Sampling
- Issue of Importance Sampling
- On-Policy -> Off-policy
- Add Constraint

* PPO / TRPO

* Q-Learning
- Critic
- Target Network
- Replay Buffer
- Tips of Q-Learning
- Double DQN
- Dueling DQN
- Prioritized Reply
- Noisy Net
- Distributed Q-function
- Rainbow
- Q-Learning for Continuous Actions

* Actor-Critic
- A3C
- Advantage Actor-Critic
- Path-wise Derivative Policy Gradient

* Imitation Learning
- Behavior Cloning

* Inverse Reinforcement Learning (IRL)
- Framework of IRL
- IRL and GAN

* Sparse Reward
- Curiosity
- Curriculum Learning
- Hierarchical Reinforcement Learning
8. 整合學習和計劃 Integrating Learning and Planning
- 基于模型的增強學習 Model-Based Reinforcement Learning
- 整合架構 Integrated Architectures
- 基于模擬的搜索 Simulation-Based Search
9. 探索與開發(fā) Exploration and Exploitation
- Multi-Armed Bandits 多臂 Bandit 裝置
- Contextual Bandits
- MDPs
10. 強化學習在游戲中的應用
- 博弈論概要
- 最小最大搜索 Minimax Search
- 自對弈增強學習 Self-Play Reinforcement Learning
- 結合強化學習和 Minimax 搜索
- 不完全信息游戲中的強化學習 RL in Imperfect-Information Games

活動詳情

提交需求