久久免费视频1,久久精品亚洲AV无码一区二区三区,国产在线观看91

課程簡(jiǎn)介

深度強(qiáng)化學(xué)習(xí)：原理、算法和應(yīng)用

目標(biāo)收益

- 幻燈片算法講解,結(jié)合代碼分析
- 深入講解強(qiáng)化學(xué)習(xí)各種算法設(shè)計(jì)、特點(diǎn)和異同
- 結(jié)合實(shí)際應(yīng)用舉例和和業(yè)界趨勢(shì)分析
- 分析強(qiáng)化學(xué)習(xí)的演示實(shí)現(xiàn)代碼

培訓(xùn)對(duì)象

1.對(duì)增強(qiáng)學(xué)習(xí)算法原理和應(yīng)用感興趣，具有一定編程（Python）和數(shù)學(xué)基礎(chǔ)（線性代數(shù)、概率論）的技術(shù)人員。
2.對(duì)深度學(xué)習(xí)（deep learning）模型有一定了解為佳

課程內(nèi)容

環(huán)境要求：
- Python 3.5 以上
- GPU：Nvidia GTX 960 以上機(jī)器

課程大綱

1. Reinforcement Learning 入門(mén)	- Reinforcement Learning 特點(diǎn) - Reinforcement Learning 案例 - Reinforcement Learning 組成 - Rewards - Environment - History and State - Observation - Agent: Policy, Value, Model - 案例：迷宮學(xué)習(xí) - Reinforcement Learning 分類 - Value Based - Policy Based - Actor Based - Model Free vs Model Based - Reinforcement Learning 中的順序決策 sequential decision making 問(wèn)題 - Learning and Planning - 案例：電子游戲 Atari - Exploration and Exploitation - Prediction and Control
2. 馬爾科夫決策過(guò)程 Markov Decision Processes (MDP)	- Markov Processes 馬爾科夫過(guò)程 - Markov Reward Processes 馬爾科夫回報(bào)過(guò)程 - Markov Decision Processes 馬爾科夫決策過(guò)程 - MDP 擴(kuò)展
3. 用動(dòng)態(tài)規(guī)劃做計(jì)劃 Planning by Dynamic Programming	- 策略評(píng)估 Policy Evaluation - 策略迭代 Policy Iteration - 價(jià)值迭代 Value Iteration - 動(dòng)態(tài)規(guī)劃擴(kuò)展 Extension to DP - 壓縮映射 Contraction Mapping
4. 無(wú)模型預(yù)測(cè) Model-Free Prediction	- 蒙特卡羅學(xué)習(xí) Monte-Carlo Learning - 時(shí)間差分學(xué)習(xí) Temporal-Difference Learning - TD( λ) 學(xué)習(xí)
5. 無(wú)模型控制 Model-Free Control	- 有策略蒙特卡羅控制 On-Policy Monte-Carlo Control - 有策略時(shí)間差分學(xué)習(xí) On-Policy Temporal-Difference Learning - 無(wú)策略學(xué)習(xí) Off-Policy Learning
6. 價(jià)值函數(shù)近似 Value Function Approximation	- 增量方法 Incremental Methods - 批量方法 Batch Methods
7. 策略梯度法 Policy Gradient	- 有限差分政策梯度 Finite Difference Policy Gradient - 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient - AC策略梯度 Actor-Critic Policy Gradient * Proximal Policy Optimization (PPO) - the default reinforcement learning algorithm at OpenAI * On-Policy v.s. Off-policy: Importance Sampling - Issue of Importance Sampling - On-Policy -> Off-policy - Add Constraint * PPO / TRPO * Q-Learning - Critic - Target Network - Replay Buffer - Tips of Q-Learning - Double DQN - Dueling DQN - Prioritized Reply - Noisy Net - Distributed Q-function - Rainbow - Q-Learning for Continuous Actions * Actor-Critic - A3C - Advantage Actor-Critic - Path-wise Derivative Policy Gradient * Imitation Learning - Behavior Cloning * Inverse Reinforcement Learning (IRL) - Framework of IRL - IRL and GAN * Sparse Reward - Curiosity - Curriculum Learning - Hierarchical Reinforcement Learning
8. 整合學(xué)習(xí)和計(jì)劃 Integrating Learning and Planning	- 基于模型的增強(qiáng)學(xué)習(xí) Model-Based Reinforcement Learning - 整合架構(gòu) Integrated Architectures - 基于模擬的搜索 Simulation-Based Search
9. 探索與開(kāi)發(fā) Exploration and Exploitation	- Multi-Armed Bandits 多臂 Bandit 裝置 - Contextual Bandits - MDPs
10. 強(qiáng)化學(xué)習(xí)在游戲中的應(yīng)用	- 博弈論概要 - 最小最大搜索 Minimax Search - 自對(duì)弈增強(qiáng)學(xué)習(xí) Self-Play Reinforcement Learning - 結(jié)合強(qiáng)化學(xué)習(xí)和 Minimax 搜索 - 不完全信息游戲中的強(qiáng)化學(xué)習(xí) RL in Imperfect-Information Games

1. Reinforcement Learning 入門(mén)

- Reinforcement Learning 特點(diǎn)
- Reinforcement Learning 案例
- Reinforcement Learning 組成
- Rewards
- Environment
- History and State
- Observation
- Agent: Policy, Value, Model
- 案例：迷宮學(xué)習(xí)
- Reinforcement Learning 分類
- Value Based
- Policy Based
- Actor Based
- Model Free vs Model Based
- Reinforcement Learning 中的順序決策 sequential decision making 問(wèn)題
- Learning and Planning
- 案例：電子游戲 Atari
- Exploration and Exploitation
- Prediction and Control

2. 馬爾科夫決策過(guò)程 Markov Decision Processes (MDP)

- Markov Processes 馬爾科夫過(guò)程
- Markov Reward Processes 馬爾科夫回報(bào)過(guò)程
- Markov Decision Processes 馬爾科夫決策過(guò)程
- MDP 擴(kuò)展

3. 用動(dòng)態(tài)規(guī)劃做計(jì)劃 Planning by Dynamic Programming

- 策略評(píng)估 Policy Evaluation
- 策略迭代 Policy Iteration
- 價(jià)值迭代 Value Iteration
- 動(dòng)態(tài)規(guī)劃擴(kuò)展 Extension to DP
- 壓縮映射 Contraction Mapping

4. 無(wú)模型預(yù)測(cè) Model-Free Prediction

- 蒙特卡羅學(xué)習(xí) Monte-Carlo Learning
- 時(shí)間差分學(xué)習(xí) Temporal-Difference Learning
- TD( λ) 學(xué)習(xí)

5. 無(wú)模型控制 Model-Free Control

- 有策略蒙特卡羅控制 On-Policy Monte-Carlo Control
- 有策略時(shí)間差分學(xué)習(xí) On-Policy Temporal-Difference Learning
- 無(wú)策略學(xué)習(xí) Off-Policy Learning

6. 價(jià)值函數(shù)近似 Value Function Approximation

- 增量方法 Incremental Methods
- 批量方法 Batch Methods

7. 策略梯度法 Policy Gradient

- 有限差分政策梯度 Finite Difference Policy Gradient
- 蒙特卡洛策略梯度 Monte-Carlo Policy Gradient
- AC策略梯度 Actor-Critic Policy Gradient

* Proximal Policy Optimization (PPO)
- the default reinforcement learning algorithm at OpenAI

* On-Policy v.s. Off-policy: Importance Sampling
- Issue of Importance Sampling
- On-Policy -> Off-policy
- Add Constraint

* PPO / TRPO

* Q-Learning
- Critic
- Target Network
- Replay Buffer
- Tips of Q-Learning
- Double DQN
- Dueling DQN
- Prioritized Reply
- Noisy Net
- Distributed Q-function
- Rainbow
- Q-Learning for Continuous Actions

* Actor-Critic
- A3C
- Advantage Actor-Critic
- Path-wise Derivative Policy Gradient

* Imitation Learning
- Behavior Cloning

* Inverse Reinforcement Learning (IRL)
- Framework of IRL
- IRL and GAN

* Sparse Reward
- Curiosity
- Curriculum Learning
- Hierarchical Reinforcement Learning

8. 整合學(xué)習(xí)和計(jì)劃 Integrating Learning and Planning

- 基于模型的增強(qiáng)學(xué)習(xí) Model-Based Reinforcement Learning
- 整合架構(gòu) Integrated Architectures
- 基于模擬的搜索 Simulation-Based Search

9. 探索與開(kāi)發(fā) Exploration and Exploitation

- Multi-Armed Bandits 多臂 Bandit 裝置
- Contextual Bandits
- MDPs

10. 強(qiáng)化學(xué)習(xí)在游戲中的應(yīng)用

- 博弈論概要
- 最小最大搜索 Minimax Search
- 自對(duì)弈增強(qiáng)學(xué)習(xí) Self-Play Reinforcement Learning
- 結(jié)合強(qiáng)化學(xué)習(xí)和 Minimax 搜索
- 不完全信息游戲中的強(qiáng)化學(xué)習(xí) RL in Imperfect-Information Games

深度強(qiáng)化學(xué)習(xí)：原理、算法和應(yīng)用

B. Ma

前百度高級(jí)算法工程師

課程費(fèi)用

7800.00 /人

課程時(shí)長(zhǎng)

3天

課程簡(jiǎn)介

目標(biāo)收益

培訓(xùn)對(duì)象

課程內(nèi)容

課程大綱

課程評(píng)論

課程費(fèi)用

7800.00 /人

課程時(shí)長(zhǎng)

3天

近期公開(kāi)課推薦

近期公開(kāi)課推薦

深度強(qiáng)化學(xué)習(xí)：原理、算法和應(yīng)用

B. Ma

前百度 高級(jí)算法工程師

課程費(fèi)用

7800.00 /人

課程時(shí)長(zhǎng)

3天

課程簡(jiǎn)介

目標(biāo)收益

培訓(xùn)對(duì)象

課程內(nèi)容

課程大綱

課程評(píng)論

課程費(fèi)用

7800.00 /人

課程時(shí)長(zhǎng)

3天

近期公開(kāi)課推薦

近期公開(kāi)課推薦

前百度高級(jí)算法工程師