PN-28: Sub-policy Adaptation for HRL (arXiv 1906)

Title: Sub-policy Adaptation for Hierarchical Reinforcement Learning （NIPS 2019 under review）

核心思想：

HRL 是解決sparse reward，long horizon 問題的關鍵方法之一，一般的HRL 方法中的skills的獲得過程和high-level controller的訓練過程是decoupling的，而且通常在一個new task中reuse low-level skills的時候，這些skills保持fixed；在new task中基於fixed skills訓練new high-level controller，實際上會導致skills的sub-optimality

這篇文章對HRL下的policy gradients theorem進行approximation，得到了一種新的hierarchical policies的學習方式，使得其避免對先驗知識和hand-craft design的依賴，並且在new task中也能方便的進行adaptation

文章基於PPO提出HiPPO，此外還提出了針對hierarchical policy gradients的baselines以減小variance，以及varying time-commitment 以及 information bottleneck low-level observation等機制，進一步保證的學習到的skills的質量，diversity以及transferability

主要內容：

1）some background

a. HRL 的工作主要都基於temporal abstraction，通常的架構為high-low level的controller，high level在更大的time scale 上make decision，表示著更具抽象意義，宏觀層面的policy

b. 這其中相當一部分是「goal reacher」的形式，e.g., hDQN, HIRO, HAC, Feudal Network等等，low-level policies基於intrinsic reward學習如何reach goal

這種形式的hierarchical policies適用於state-reaching tasks（比如Montezumas Revenge），然而由於intrinsic reward function並不是original MDP中自帶的，所以這一點一定程度上限制了low-level policy在非state-reaching tasks的運用；另外，在new task上，如果low-level policy很可能需要重新design intrinsic reward function進行調整。

這篇paper中，通過直接對hierarchical policy gradients的approximation，基於extrinsic reward function，simultaneously學習high-level policy和skills，避免了對intrinsic reward function的依賴

c. 另一類HRL的方法基於option framework，e.g., OC, A2OC，這類方法最「脆弱」的部分是termination function，即determine when the option should terminate，通常需要一些regularizer來避免skill collapse，例如多個option中只有一個真正有用（這一點似乎也是OC為人詬病的一點）

這篇paper中，使用random length skills，避免了對termination function的訓練，同時也獲得了more stable skills

2）preliminaries

典型的two-level hierarchy，如下圖所示，high-level policy輸出latent codes z，low-level policy接受z輸出primitive actions，每個high-level action 持續p個steps，p是一個random variable

3）Approximate Hierarchical PG

a. PG的一般形式如下式1所示， au 是 state-action trajectory

在上述的HRL model下，軌跡的概率可以表示成，

第一個聯乘表示high-level scale上的k段，第二個聯乘表示每個段內的action trajectory概率，最後一項為dynamics

b. 上式2中存在對high-level policy的求和，這使得log operator無法將式2轉換成求和的形式，這導致對PG的計算難以得到穩定的結果。

嘗試對式2進行直接求PG，

難以處理的是式2中的求和項，對求和中的每一項進行分析，實際上是某個z_j下的sub-trajectory的梯度，

這裡，文中基於skills應當easy to distinguish的假設，或者說，skills應當diverse，對於每個action只有一個sub-policy 給其比較高的概率，即，

上式表示，如果z_j不是kp時刻選定那個z，則其選當前action的概率小於epsilon，則在該z_j下出現該sub-trajectory的概率的upper bound 為 epsilon^{p}，進而得到，

從而，將求和項進行替換，得到Lemma 1，

Lemma 1 中給出的，實際上恰好是state-latent-action trajectory的概率的梯度，

（完整的proof詳見appendix）

c. 得到Lemma 1中的近似梯度之後，既可以通過PG演算法對policy parameters進行更新了

考慮到PG中的variance問題，常規操作是減掉一個unbiased的state-dependent baseline，這裡結合hierarchical model，對low-level policy設計了state-latent-dependent baseline，由於latent與low-level policy 無關，也可類似地證明其unbiased

得到advantaged approximated hierarchical PG為，

4）HiPPO

a. 基於上述得到的PG，用PPO進行具體實現，

b. varying time-commitment

大多數之前的方法，使用fixed time-commitment，即一個high-level action maintain fixed steps，或者基於complex option framework進行實現，e.g., termination function，

這裡提出使用varying time-commitment，從Tmin，Tmax區間內採樣，一定程度上避免了skill collapse，和提升了robustness

c. information bottleneck low-level observation

mask 環境的state，得到

通過mask掉task-related的information，使得skills基於proprioceptive information進行決策，所有task相關的information來自於latent code

這樣實際上是一個lossy compression過程，這同樣保證了得到的skills的質量以及在new task中的transferability

5）overall algorithm

實驗：

環境主要是Snake Gather和Ant Gather以及其變種

1）對比flat RL methods

主要對比了PPO，進一步地，為了充分表明HRL的優勢，而不僅僅是temporal extended exploration，比較了PPO with action repeat，以及HiPPO with p=1

2）robustness to dynamics perturbations

在original task中訓練policy，通過改變環境的一些設定，e.g., mujoco 中的physical properties，使得到的policy在perturbated tasks中進行evaluation，

作者得出結論，HiPPO with random p具有最好的zero-shot transfer performance （似乎下表中的結果並不能支持這一點結論...）

3）Adaptation of Pre-trained skills

通過SNN 得到6個pre-trained skills，進行如下實驗，表明HiPPO對pre-train skill的further improve 能力

4）Skill diversity assumption

最後作者回頭check了一下Lemma 1中的假設，即，

table中的結果表明學到的policy與assumption基本上能自恰，

（一個小bug）

總結：

1）文中得到approximated hierarchical PG的方式與OC類似，都是對hierarchical model下的PG的不同方式的推導，approximated hierarchical PG的確提供了hierarchical policy的一種新的訓練方式

2）Lemma 1基於的skill diversity假設，與DIAYN中對skill的期望是類似的

3）varying time-commitment和information bottleneck，個人感覺contribution一般，實驗中也沒有比較強力的表現出這兩個factor的effectiveness

4）缺乏與state-of-the-art 的HRL方法的比較