Importance Sampling In Reinforcement learning
Introduction
In Monte Carlo off-policy learning, we need to use one policy to generate data, and use the data to optimize another policy. How to realize this fantacy dream? Is this possible?
In this post, I will show you how to realize it with minimal math and some python konwlege.
How to understand importance sampling?
To optimize a policy from the data generated by another policy, we need to use importance sampling.
Problem definition
We have target policy , and known soft policy (like gaussian distribution), we want to collect data through policy , and use the data to perform GPI to optimize .
Math proof
In math world, and are distributions, and with distribution, we get result, in this case, episode. In math language, with a function that generate episode, whose distribution is . In formula, that is: We use to present , and to presen in the context, since we are describing math. Here .
General importance sampling
When if for ,
And , which is importance weights. We can use this formula to calculate the expect value. However, this formula has big variance, which is not stable. Then we have another method.
Weighted importance sampling
Here, , means unnormalized , and is used to normalize , which equals to sum of over .
Here, what does mean? It’s a constant, and
Also, we have to state that , since
From , we can know that
which is also called weighted importance sampling.
Connections to reinforcement learning
For probability of the rest of the episode, after , under policy , we have
Thus, the relative probability of the episode under the target and behavior policies (the importance-sampling ratio) is
Although the trajectory probabilities depend on the MDP’s transition probabilities, which are generally unknown, they appear identically in both the numerator and denominator, and thus cancel. The importance sampling ratio ends up depending only on the two policies and the sequence, not on the MDP!
What a beautiful ratio! That means we only need to multiple a ratio for the sampled returns to get a expectd value which is sampled from our target policy . That is
Then we can simply scale the returns by the ratios and average the results:
denotes the set of all time steps in which state is visited, or appear numbers.
An important alternative is weighted importance sampling (from ), which uses a weighted average, defined as
If we define , then We can use to update !
Intuition
From above, we can see that we can use a importance sampling ratio to make the data generated by policy usable by policy . But how to understand it intuitively?
In fact, can represent the relationships between and for responding reward. For example, yyxsb and a robot is playing a game, yyxsb is stupid and he collected many data for robot to learn, but the from episode generated by yyxsb is different from robot policy, but we have to brigde them. That’s it.
Talk is cheap, show me the code
I will show you importance sampling really works through python code.
Let be the distribution we want to sample from. The normalizing constant . Let the functions that we want to approximate be and . I will use general importance sampling here.
import numpy as np
import matplotlib.pyplot as plt
P = lambda x: 3 * np.exp(-x*x/2) + np.exp(-(x - 4)**2/2)
Z = 10.0261955464
f_x = lambda x: x
g_x = lambda x: np.sin(x)
true_expected_fx = 10.02686647165
true_expected_gx = -1.15088010640
a, b = -4, 8
uniform_prob = 1./(b - a)
expected_f_x = 0.
expected_g_x = 0.
n_samples = 100000
den = 0.
for i in range(n_samples):
sample = np.random.uniform(a, b)
importance = P(sample) / uniform_prob
den += importance
expected_f_x += importance * f_x(sample)
expected_g_x += importance * g_x(sample)
expected_f_x /= den
expected_g_x /= den
expected_f_x *= Z
expected_g_x *= Z
print('E[f(x)] = %.5f, Error = %.5f' % (expected_f_x, abs(expected_f_x - true_expected_fx)))
print('E[g(x)] = %.5f, Error = %.5f' % (expected_g_x, abs(expected_g_x - true_expected_gx)))
Result:
E[f(x)] = 10.02532, Error = 0.00155
and E[g(x)] = -1.17328, Error = 0.02240
The sampling works!
Summary
In this post, I’ve talked about why and how importance sampling works, I hope you can understand it.
References
- Weighted Importance Sampling Techniques for Monte Carlo Radiosity
- MC learning
- DeepReinforcement LearningandControl
- Code: The basic importance sampling algorithm
Welcome to share or comment on this post: