# Importance Sampling In Reinforcement learning

## Introduction

In Monte Carlo off-policy learning, we need to use one policy to generate data, and use the data to optimize another policy. How to realize this fantacy dream? Is this possible?

In this post, I will show you how to realize it with minimal math and some python konwlege.

## How to understand importance sampling?

To optimize a policy from the data generated by another policy, we need to use importance sampling.

### Problem definition

We have target policy $\pi$, and known soft policy $b$(like gaussian distribution), we want to collect data through policy $b$, and use the data to perform GPI to optimize $\pi$.

### Math proof

In math world, $\pi$ and $b$ are distributions, and with distribution, we get result, in this case, episode. In math language, with a function that generate episode, whose distribution is $\pi$. In formula, that is: $\mathbb{E}[f]=\int f(z) p(z) d \tag{1}$ We use $p(z)$ to present $\pi$, and $q(z)$ to presen $b$ in the context, since we are describing math. Here $w(z) = \frac{p(z)}{q(z)}$.

#### General importance sampling

When $q(z)>0$ if $p(z)>0$ for $\forall z\in Z$,
\begin{aligned} \mathbb{E}[f] &=\int f(z) p(z) d z \\ &=\int f(z) \frac{p(z)}{q(z)} q(z) d z \\ & \approx \frac{1}{N} \sum_{n} \frac{p\left(z^{n}\right)}{q\left(z^{n}\right)} f\left(z^{n}\right), z^{n} \sim q(z) \end{aligned} And $w^{n}=p\left(z^{n}\right) / q\left(z^{n}\right)$, which is importance weights. We can use this formula to calculate the expect value. However, this formula has big variance, which is not stable. Then we have another method.

#### Weighted importance sampling

\begin{aligned} (1) & =\int f(z) \frac{p(z)}{q(z)} q(z) d z \\ &=\frac{Z_{q}}{Z_{p}} \int f(z) \frac{\tilde{p}(z)}{\tilde{q}(z)} q(z) d z \\ & \approx \frac{\mathcal{Z}_{q}}{\mathcal{Z}_{p}} \frac{1}{N} \sum_{n} \frac{\tilde{p}\left(z^{n}\right)}{\tilde{q}\left(z^{n}\right)} f\left(z^{n}\right) \\ &=\frac{\mathcal{Z}_{q}}{\mathcal{Z}_{p}} \frac{1}{N} \sum_{n} w^{n} f\left(z^{n}\right) \tag{2} \end{aligned} Here, $p(x)=\frac{\hat{p}(x)}{Z}$, $\hat{p}(x)$ means unnormalized $p(x)$, and $Z$ is used to normalize $p(z)$, which equals to sum of $\hat{p}(x)$ over $x$.

Here, what does $\frac{\mathcal{Z}_{q}}{\mathcal{p}}$ mean? It’s a constant, and \begin{aligned} \frac{\mathcal{Z}_{p}}{\mathcal{Z}_{q}} &=\frac{1}{\mathcal{Z}_{q}} \int \tilde{p}(z) d z \\ &=\int \frac{\tilde{p}(z)}{\tilde{q}(z)} q(z) d z \\ &\approx \frac{1}{N} \sum_{n} \frac{\tilde{p}\left(z^{n}\right)}{\tilde{q}^{n}(z)} \\ &=\frac{1}{N} \sum_{n} w^{n} \tag{3} \end{aligned}

Also, we have to state that $\mathbb{E}[\frac{\mathcal{Z}_{p}}{\mathcal{Z}_{q}}] = 1$, since \begin{aligned} E_{q}[w] &=\int_{D} w(x) q(x) d x \\ &=\int_{D} \frac{p(x)}{q(x)} q(x) d x \\ &=\int_{D} p(x) d x=1 \end{aligned}

From $(2),(3)$, we can know that $\mathbb{E}[f] \approx \sum_{n=1}^{N} \frac{w^{n}}{\sum_{m=1}^{N} w^{m}} f\left(z^{n}\right), \quad z^{n} \sim q(z) \tag{4}$

which is also called weighted importance sampling.

#### Connections to reinforcement learning

For probability of the rest of the episode, after $S_{t}$, under policy $\pi$, we have $\begin{array}{l}{\operatorname{Pr}\left\{A_{t}, S_{t+1}, A_{t+1}, \ldots, S_{T} | S_{t}, A_{t : T-1} \sim \pi\right\}} \\ {\quad=\pi\left(A_{t} | S_{t}\right) p\left(S_{t+1} | S_{t}, A_{t}\right) \pi\left(A_{t+1} | S_{t+1}\right) \cdots p\left(S_{T} | S_{T-1}, A_{T-1}\right)} \\ {\quad=\prod_{k=t}^{T-1} \pi\left(A_{k} | S_{k}\right) p\left(S_{k+1} | S_{k}, A_{k}\right)}\end{array}$

Thus, the relative probability of the episode under the target and behavior policies (the importance-sampling ratio) is $\rho_{t : T-1} \doteq \frac{\prod_{k=t}^{T-1} \pi\left(A_{k} | S_{k}\right) p\left(S_{k+1} | S_{k}, A_{k}\right)}{\prod_{k=t}^{T-1} b\left(A_{k} | S_{k}\right) p\left(S_{k+1} | S_{k}, A_{k}\right)}=\prod_{k=t}^{T-1} \frac{\pi\left(A_{k} | S_{k}\right)}{b\left(A_{k} | S_{k}\right)}$

Although the trajectory probabilities depend on the MDP’s transition probabilities, which are generally unknown, they appear identically in both the numerator and denominator, and thus cancel. The importance sampling ratio ends up depending only on the two policies and the sequence, not on the MDP!

What a beautiful ratio! That means we only need to multiple a ratio for the sampled returns $G_{t}$ to get a expectd value which is sampled from our target policy $\pi$. That is $\mathbb{E}\left[\rho_{t : T-1} G_{t} | S_{t}=s\right]=v_{\pi}(s)$

Then we can simply scale the returns by the ratios and average the results: $V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t : T(t)-1} G_{t}}{|\mathcal{J}(s)|}$

$\mathcal{J}(s)$ denotes the set of all time steps in which state $s$ is visited, or appear numbers.

An important alternative is weighted importance sampling (from $(4)$), which uses a weighted average, defined as $V(s) \doteq \frac{\sum_{t \in \mathcal{T}(s)} \rho_{t : T(t)-1} G_{t}}{\sum_{t \in \mathcal{T}(s)} \rho_{t : T(t)-1}}$

If we define $W_{i}=\rho_{t_{i}} : T\left(t_{i}\right)-1$, then $V_{n} \doteq \frac{\sum_{k=1}^{n-1} W_{k} G_{k}}{\sum_{k=1}^{n-1} W_{k}} \tag{5}$ We can use $(5)$ to update $V$!

### Intuition

From above, we can see that we can use a importance sampling ratio to make the data generated by policy $b$ usable by policy $\pi$. But how to understand it intuitively?

In fact, $w_{i}$ can represent the relationships between $\pi$ and $b$ for responding reward. For example, yyxsb and a robot is playing a game, yyxsb is stupid and he collected many data for robot to learn, but the $G$ from episode generated by yyxsb is different from robot policy, but we have $w_{i}$ to brigde them. That’s it.

### Talk is cheap, show me the code

I will show you importance sampling really works through python code.
Let $P(x)=3 e^{-\frac{x^{2}}{2}}+e^{-\frac{(x-4)^{2}}{2}}$ be the distribution we want to sample from. The normalizing constant $Z \approx 10.0261955464$. Let the functions that we want to approximate be $f(x) = x$ and $g(x) = sin(x)$. I will use general importance sampling here.

import numpy as np
import matplotlib.pyplot as plt

P = lambda x: 3 * np.exp(-x*x/2) + np.exp(-(x - 4)**2/2)
Z = 10.0261955464

f_x = lambda x: x
g_x = lambda x: np.sin(x)
true_expected_fx = 10.02686647165
true_expected_gx = -1.15088010640

a, b = -4, 8
uniform_prob = 1./(b - a)

expected_f_x = 0.
expected_g_x = 0.
n_samples = 100000
den = 0.
for i in range(n_samples):
sample = np.random.uniform(a, b)
importance = P(sample) / uniform_prob
den += importance
expected_f_x += importance * f_x(sample)
expected_g_x += importance * g_x(sample)
expected_f_x /= den
expected_g_x /= den
expected_f_x *= Z
expected_g_x *= Z
print('E[f(x)] = %.5f, Error = %.5f' % (expected_f_x, abs(expected_f_x - true_expected_fx)))
print('E[g(x)] = %.5f, Error = %.5f' % (expected_g_x, abs(expected_g_x - true_expected_gx)))


Result:
E[f(x)] = 10.02532, Error = 0.00155 and E[g(x)] = -1.17328, Error = 0.02240

The sampling works!

## Summary

In this post, I’ve talked about why and how importance sampling works, I hope you can understand it.

Welcome to share or comment on this post: