Importance Sampling In Reinforcement learning


In Monte Carlo off-policy learning, we need to use one policy to generate data, and use the data to optimize another policy. How to realize this fantacy dream? Is this possible?

In this post, I will show you how to realize it with minimal math and some python konwlege.

How to understand importance sampling?

To optimize a policy from the data generated by another policy, we need to use importance sampling.

Problem definition

We have target policy π\pi, and known soft policy bb(like gaussian distribution), we want to collect data through policy bb, and use the data to perform GPI to optimize π\pi.

Math proof

In math world, π\pi and bb are distributions, and with distribution, we get result, in this case, episode. In math language, with a function that generate episode, whose distribution is π\pi. In formula, that is: E[f]=f(z)p(z)d(1)\mathbb{E}[f]=\int f(z) p(z) d \tag{1} We use p(z)p(z) to present π\pi, and q(z)q(z) to presen bb in the context, since we are describing math. Here w(z)=p(z)q(z)w(z) = \frac{p(z)}{q(z)}.

General importance sampling

When q(z)>0q(z)>0 if p(z)>0p(z)>0 for zZ\forall z\in Z,
E[f]=f(z)p(z)dz=f(z)p(z)q(z)q(z)dz1Nnp(zn)q(zn)f(zn),znq(z)\begin{aligned} \mathbb{E}[f] &=\int f(z) p(z) d z \\ &=\int f(z) \frac{p(z)}{q(z)} q(z) d z \\ & \approx \frac{1}{N} \sum_{n} \frac{p\left(z^{n}\right)}{q\left(z^{n}\right)} f\left(z^{n}\right), z^{n} \sim q(z) \end{aligned} And wn=p(zn)/q(zn)w^{n}=p\left(z^{n}\right) / q\left(z^{n}\right), which is importance weights. We can use this formula to calculate the expect value. However, this formula has big variance, which is not stable. Then we have another method.

Weighted importance sampling

(1)=f(z)p(z)q(z)q(z)dz=ZqZpf(z)p~(z)q~(z)q(z)dzZqZp1Nnp~(zn)q~(zn)f(zn)=ZqZp1Nnwnf(zn)(2)\begin{aligned} (1) & =\int f(z) \frac{p(z)}{q(z)} q(z) d z \\ &=\frac{Z_{q}}{Z_{p}} \int f(z) \frac{\tilde{p}(z)}{\tilde{q}(z)} q(z) d z \\ & \approx \frac{\mathcal{Z}_{q}}{\mathcal{Z}_{p}} \frac{1}{N} \sum_{n} \frac{\tilde{p}\left(z^{n}\right)}{\tilde{q}\left(z^{n}\right)} f\left(z^{n}\right) \\ &=\frac{\mathcal{Z}_{q}}{\mathcal{Z}_{p}} \frac{1}{N} \sum_{n} w^{n} f\left(z^{n}\right) \tag{2} \end{aligned} Here, p(x)=p^(x)Zp(x)=\frac{\hat{p}(x)}{Z}, p^(x)\hat{p}(x) means unnormalized p(x)p(x), and ZZ is used to normalize p(z)p(z), which equals to sum of p^(x)\hat{p}(x) over xx.

Here, what does Zqp\frac{\mathcal{Z}_{q}}{\mathcal{p}} mean? It’s a constant, and ZpZq=1Zqp~(z)dz=p~(z)q~(z)q(z)dz1Nnp~(zn)q~n(z)=1Nnwn(3)\begin{aligned} \frac{\mathcal{Z}_{p}}{\mathcal{Z}_{q}} &=\frac{1}{\mathcal{Z}_{q}} \int \tilde{p}(z) d z \\ &=\int \frac{\tilde{p}(z)}{\tilde{q}(z)} q(z) d z \\ &\approx \frac{1}{N} \sum_{n} \frac{\tilde{p}\left(z^{n}\right)}{\tilde{q}^{n}(z)} \\ &=\frac{1}{N} \sum_{n} w^{n} \tag{3} \end{aligned}

Also, we have to state that E[ZpZq]=1\mathbb{E}[\frac{\mathcal{Z}_{p}}{\mathcal{Z}_{q}}] = 1, since Eq[w]=Dw(x)q(x)dx=Dp(x)q(x)q(x)dx=Dp(x)dx=1\begin{aligned} E_{q}[w] &=\int_{D} w(x) q(x) d x \\ &=\int_{D} \frac{p(x)}{q(x)} q(x) d x \\ &=\int_{D} p(x) d x=1 \end{aligned}

From (2),(3)(2),(3), we can know that E[f]n=1Nwnm=1Nwmf(zn),znq(z)(4)\mathbb{E}[f] \approx \sum_{n=1}^{N} \frac{w^{n}}{\sum_{m=1}^{N} w^{m}} f\left(z^{n}\right), \quad z^{n} \sim q(z) \tag{4}

which is also called weighted importance sampling.

Connections to reinforcement learning

For probability of the rest of the episode, after StS_{t}, under policy π\pi, we have Pr{At,St+1,At+1,,STSt,At:T1π}=π(AtSt)p(St+1St,At)π(At+1St+1)p(STST1,AT1)=k=tT1π(AkSk)p(Sk+1Sk,Ak)\begin{array}{l}{\operatorname{Pr}\left\{A_{t}, S_{t+1}, A_{t+1}, \ldots, S_{T} | S_{t}, A_{t : T-1} \sim \pi\right\}} \\ {\quad=\pi\left(A_{t} | S_{t}\right) p\left(S_{t+1} | S_{t}, A_{t}\right) \pi\left(A_{t+1} | S_{t+1}\right) \cdots p\left(S_{T} | S_{T-1}, A_{T-1}\right)} \\ {\quad=\prod_{k=t}^{T-1} \pi\left(A_{k} | S_{k}\right) p\left(S_{k+1} | S_{k}, A_{k}\right)}\end{array}

Thus, the relative probability of the episode under the target and behavior policies (the importance-sampling ratio) is ρt:T1k=tT1π(AkSk)p(Sk+1Sk,Ak)k=tT1b(AkSk)p(Sk+1Sk,Ak)=k=tT1π(AkSk)b(AkSk)\rho_{t : T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi\left(A_{k} | S_{k}\right) p\left(S_{k+1} | S_{k}, A_{k}\right)}{\prod_{k=t}^{T-1} b\left(A_{k} | S_{k}\right) p\left(S_{k+1} | S_{k}, A_{k}\right)}=\prod_{k=t}^{T-1} \frac{\pi\left(A_{k} | S_{k}\right)}{b\left(A_{k} | S_{k}\right)}

Although the trajectory probabilities depend on the MDP’s transition probabilities, which are generally unknown, they appear identically in both the numerator and denominator, and thus cancel. The importance sampling ratio ends up depending only on the two policies and the sequence, not on the MDP!

What a beautiful ratio! That means we only need to multiple a ratio for the sampled returns GtG_{t} to get a expectd value which is sampled from our target policy π\pi. That is E[ρt:T1GtSt=s]=vπ(s)\mathbb{E}\left[\rho_{t : T-1} G_{t} | S_{t}=s\right]=v_{\pi}(s)

Then we can simply scale the returns by the ratios and average the results: V(s)tT(s)ρt:T(t)1GtJ(s)V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t : T(t)-1} G_{t}}{|\mathcal{J}(s)|}

J(s)\mathcal{J}(s) denotes the set of all time steps in which state ss is visited, or appear numbers.

An important alternative is weighted importance sampling (from (4)(4)), which uses a weighted average, defined as V(s)tT(s)ρt:T(t)1GttT(s)ρt:T(t)1V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t : T(t)-1} G_{t}}{\sum_{t \in \mathcal{T}(s)} \rho_{t : T(t)-1}}

If we define Wi=ρti:T(ti)1W_{i}=\rho_{t_{i}} : T\left(t_{i}\right)-1, then Vnk=1n1WkGkk=1n1Wk(5)V_{n} \doteq \frac{\sum_{k=1}^{n-1} W_{k} G_{k}}{\sum_{k=1}^{n-1} W_{k}} \tag{5} We can use (5)(5) to update VV!


From above, we can see that we can use a importance sampling ratio to make the data generated by policy bb usable by policy π\pi. But how to understand it intuitively?

In fact, wiw_{i} can represent the relationships between π\pi and bb for responding reward. For example, yyxsb and a robot is playing a game, yyxsb is stupid and he collected many data for robot to learn, but the GG from episode generated by yyxsb is different from robot policy, but we have wiw_{i} to brigde them. That’s it.

Talk is cheap, show me the code

I will show you importance sampling really works through python code.
Let P(x)=3ex22+e(x4)22P(x)=3 e^{-\frac{x^{2}}{2}}+e^{-\frac{(x-4)^{2}}{2}} be the distribution we want to sample from. The normalizing constant Z10.0261955464Z \approx 10.0261955464. Let the functions that we want to approximate be f(x)=xf(x) = x and g(x)=sin(x)g(x) = sin(x). I will use general importance sampling here.

import numpy as np
import matplotlib.pyplot as plt

P = lambda x: 3 * np.exp(-x*x/2) + np.exp(-(x - 4)**2/2)
Z = 10.0261955464

f_x = lambda x: x
g_x = lambda x: np.sin(x)
true_expected_fx = 10.02686647165
true_expected_gx = -1.15088010640

a, b = -4, 8
uniform_prob = 1./(b - a)

expected_f_x = 0.
expected_g_x = 0.
n_samples = 100000
den = 0.
for i in range(n_samples):
    sample = np.random.uniform(a, b)
    importance = P(sample) / uniform_prob
    den += importance
    expected_f_x += importance * f_x(sample)
    expected_g_x += importance * g_x(sample)
expected_f_x /= den
expected_g_x /= den
expected_f_x *= Z
expected_g_x *= Z
print('E[f(x)] = %.5f, Error = %.5f' % (expected_f_x, abs(expected_f_x - true_expected_fx)))
print('E[g(x)] = %.5f, Error = %.5f' % (expected_g_x, abs(expected_g_x - true_expected_gx)))

E[f(x)] = 10.02532, Error = 0.00155 and E[g(x)] = -1.17328, Error = 0.02240

The sampling works!


In this post, I’ve talked about why and how importance sampling works, I hope you can understand it.


Welcome to share or comment on this post:

Table of Contents