RL

Multi-task Learning

Ideas

Meta-learning [C]

black-box approaches [C] black-box

optimization-based [C]

metric learning [C]

meta-overfitting

unsupervised

goal-conditioned

Hierachical

Lifelong

AutoML

Different agents at the different scene and then switching the agents, while different agent can live across environments

Objective

various heuristics

use task uncertainty

aim for monotonic improvement towards Pareto optimal solution

optimiz for the worst-case task loss

Chen et al. GradNorm. ICML 2018

Kendall et al. CVPR 2018

Sener et al. NeurlPS 2018

Pareto optimal

From loss perspective

θ is Pareto optimal if there exists no θ that dominates θ

\( \theta_a \) dominates \( \theta_b\) of \( \mathcal{L}_i(\theta_a) \leq \mathcal{L}_i(\theta_b) \forall i\) and if \(\sum_i\mathcal{L}_i(\theta_a)\neq\sum_i\mathcal{L}_i(\theta_b)\)

Challeges

Negative transfer

limited representation capacity

often need much larger networks

optimization challenges

tasks may learn at different rates

caused by cross-task interference

overfitting

Architecture

MultiHead mulit-head

soft-parameter sharing

Multi-gate Mixture-of-Experts (MMoE) MMoE

Transfer learning

Fine-tuning

common practices [C]

Challenge:outputting all neural net parameters does not seem scalable

Only output sufficient statistics

External Memory

Mishra et al. SNAIL, 17

Neural Turing Machine

Santoro et al. MANN

Munkhdalai, Yu. ICML17.Meta Networks

Feedforward+average

Garnelo, Conditional Neural Processes, ICML 18

Finn, ICML 2017 MAML:\({\rm min}_\theta\sum_{{\rm task \ } i} \mathcal{L}\big(\theta - \alpha\bigtriangledown_\theta \mathcal{L}(\theta, \mathcal{D}_i^{\rm tr}), \mathcal{D}_i^{\rm ts} \big)\)

Ravi ICLR 17, precedes MAML:\(\phi_i=\theta - \alpha f(\theta, \mathcal{D}_i^{\rm tr}, \bigtriangledown_\theta\mathcal{L})\)

Finn & Levine ICLR 18, For a sufficiently deep network, MAML function can approximate any function of \(\mathcal{D}_i^{\rm tr}, x^{\rm ts}\)

challenges

Bi-level optimization can exhibit instabilities

Backpropagating through many inner gradient step is compute-and-memory-intensive

Automatically learn inner vector learning rate, tune outer learning rate

Optimize only a subset of the parameters in the inner loop

Decouple inner learning rate, BN statistics per-step

introduce context variables for increased expressive power

Antoniou et al. MAML++

Li et al. Meta-SGD

Behl et al. AlphaMAML

Finn et al. bias transformation

Zintgraf et al. CAVIA

Zhou et al. DEML

[CRUDELY] approximate \(\frac{d\phi_i}{d\theta}\) as identity[C]

Finn et al. first-order MAML 17

Nichol et al. Reptile 18

Only optimize the last layer of weights [C]

Bertinetto et al. R2-D2 19, ridge regression, logistic regression

Lee et al. MetaOptNet 19

Derive meta-gradient using the implicit function theorem [C]

Rajeswaran, Finn, Implicit MAML 19

How to choose architecture that is effective for inner gradient step?

Kim et al. Auto-Meta, progressive NAS+MAML [C]

Use non-parametric learner

Koch et al. ICML 15, Siamese network

Matching

Vinyals et al. Matching Networks, NIPS 16

Snell et al. Prototypical Networks, NIPS 17

challeges

More complex relationships between datapoints

Sung et al. Relation Net, learn non-linear relation module on embeddings

Allen et al. IMP, ICML 19, Learn infinite mixture of prototypes

Garcia, GNN, perform message passing on embeddings

Viraj et al. NIPS 18 workshop, Prototypical Clustering Networks for Dermatological Image Classification [C]

Triantafillou et al. Proto-MAML 19, initialize last layer as ProtoNet during meta-training

Rusu et al. LEO 19, gradient descent on relation net embedding

Yu, Finn et al. One-shot Imitation from Observing Humans, RSS2018

Meta-Learning GNN Initializations for Low-Resource Molecular Property Prediction, 2020

Few-Shot Human Motion Prediction via Meta-learning, ECCV 2018

Transformer

Brown, 2020, GPT3

Hindsight Labeling [C]

challenges

Model-based [C]

Optimize over actions using model \({\rm max}_{a_{t:t+H}} \sum_t r(s_t, a_t)\)

backpropagation

sampling

Plan & replan using model

model-predictive control (MPC) [C]

reward function knowing

without knowing reward function

Nagabandi, Deep Dynamics Models for Learning Dexterous Manipulation, CoRL 19

Xie, Few-shot Goal Inference for Visuomotor Learning and Planning, CoRL 18

High-dimension Image Observation

Models in latent space [C]

models directly in image space [C]

model alternative quantities [C]

Watter, Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images, NIPS 2015 [C]

Finn, Deep Spatial Autoencoders for Visumotor Learning, ICRA 2016

also predict reward

Jaderberg 17

Shelhamer 17

MPC

Finn, CoRL 17

Villegas, NIPS 19

Finn, Deep Visual Foresight for Planning Robot Motion, ICRA1 7

Pinto 16

Kahn 17

Dosovitskiy 17

RL[C]

duan, RL2, 17

Wang, Learning to Reinforcement learn, CogSci 17

Rakelly, PEARL, Efficient Off-policy Meta-reinforcement Leanring via Probabilitic Context Variables, ICML 19

RL [C]

click to edit

MAML+PG

MAML+MBRL

Nagabandi, Learning to Adapt in Dynamic Environemtnts through Meta-RL, ICLR 19

Exploration

Learning to Explore

End-to-end optimization [C]

Alternative exploration strategies [C]

Decoupled exploration & exploitation

Finn, Learning to learn with gradients, Phd Thesis 2018

stadie 2018

Zintgraf 2019

Karnienny 2020

Posterior sampling (Thompson sampling)

task dynamics & reward prediction

Zhang, MetaCURE 2020

Decouple by acquiring representation of task relevant information

Liu, Explore then Execute: Adopting without Rewards via Factorized Meta-RL, 2020

Information theoretic

entropy

Mutual information

KL-divergence

\(\mathcal{H}(p(x))=-E_{x \sim p(x)}[\log p(x)]\)

\(D_{\rm KL(q||p)} = E_q[log\frac{q(x)}{p(x)}] = -E_q\log p(x)-\mathcal{H}(q(x))\)

How broad \(p(x)\) is

Distance between two distributions

\(\mathcal{I}(x;y)=D_{\rm KL}(p(x,y) || p(x)p(y))=\mathcal{H}(p(x))-\mathcal{H}(p(x|y))\)

if x and y are independent, mi will be low

\(\mathcal{I}(s_{t+1};a_t)=\mathcal{H}(s_{t+1})-\mathcal{H}(s_{t+1}|a_t)\)

skill

Eysenbach, Diversity is All you need [C]

Sharma Gu, DADS, 2019 [C]

design choices

Nachum, Why Does Hierarchy (Sometimes) Work? 2019 [C]

off-policy

Nachum, HIRO, 2018

self-terminating

Bacon, The Option-Critic Architecture, 2016

pretrain

Hess, Learning Locomotor Controllers, 2016

goal-condition

Gupta, Relay Policy Learning, 2019

T Yu, S Kumar, PCGrad, Gradient Surgery for Multi-Task Learning, NIPS 20