Bellman Equation V(s) Proof

Why we need to understand it?

Bellman equation is a key point  for understanding reinforcement learning, however, I didn’t find any materials that write the proof for it. In this post, I will show you how to prove it easily.

Simple Proof

For all sS:s \in \mathcal{S}:

vπ(s)Eπ[GtSt=s] \begin{aligned}v_{\pi}(s)\doteq\mathbb{E}_{\pi}\left[G_{t}|S_{t}=s\right] \end{aligned}

=Eπ[Rt+1+γGt+1St=s](1) \begin{aligned} \tag{1} =\mathbb{E}_{\pi}\left[R_{t+1}+\gamma G_{t+1} | S_{t}=s\right] \end{aligned}

equals to aπ(as)srp(s,rs,a)[r+γEπ[Gt+1St+1=s]](2)\begin{aligned} \sum_{a} \pi(a | s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right)\left[r+\gamma \mathbb{E}{\pi}\left[G_{t+1} | S_{t+1}=s^{\prime}\right]\right] \tag{2} \end{aligned}

 =aπ(as)s,rp(s,rs,a)[r+γvπ(s)]   \begin{aligned}  =\sum_{a} \pi(a | s) \sum_{s^{\prime}, r} p\left(s^{\prime}, r | s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right]   \end{aligned}

That is: vπ(s)=aAπ(as)(Rsa+γsSPssavπ(s)) v_{\pi}(s)=\sum_{a \in \mathcal{A}} \pi(a | s)\left(\mathcal{R}_{s}^{a}+\gamma \sum_{s^{\prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}}^{a} v_{\pi}\left(s^{\prime}\right)\right)

We need to prove (1)(1) to (2)(2)

In this example, I just improve this aligned:

Eπ[Rt+1Sk=s]\begin{aligned} E_{\pi}\left[R_{t+1} | S_{k}=s\right] \end{aligned}


For general situation, E=xXxp(x)\begin{aligned} E=\sum_{x \in X} x \cdot p(x)\end{aligned}

For this question, the target is to get Eπ[Rt+1St=s]\begin{aligned} \mathbb{E}_{\pi}\left[R_{t+1} | S_{t}=s\right]\end{aligned}, thus p(x)p(x) means p(rs)p(r|s), and then:

Eπ=rRrp(rs) \begin{aligned} E_{\pi}=\sum_{r \in R} r \cdot p(r | s) \end{aligned}

which equals to 

aπ(as)rRp(rs,a)r(3) \begin{aligned} \sum_{a} \pi(a | s) \cdot \sum_{r \in R} p(r | s, a) \cdot r \tag{3} \end{aligned}

Why? p(r,a)=p(ra)p(a)p(r,a)=p(r|a)\cdot p(a), thus p(r,as)=p(ra,s)p(as)p(r,a|s) = p(r|a,s)\cdot p(a|s). And p(rs)=ap(r,as)=ap(as)p(ra,s)p(r|s) = \sum_{a}p(r,a|s) = \sum_{a}p(a|s)\cdot p(r|a,s), in bellman function, p(as)=π(as)p(a|s) = \pi(a|s)


p(rs,a)=sSp(s,rs,a) \begin{aligned} p(r | s, a)=\sum_{s^{\prime} \in S} p\left(s^{\prime}, r | s, a\right) \end{aligned}

Thus we have (3)=aπ(as)rRsSp(s,rs,a)r \begin{aligned} (3)=\sum_{a} \pi(a | s) \cdot \sum_{r \in R} \cdot \sum_{s^{\prime} \in S} p\left(s^{\prime}, r | s, a\right) \cdot r \end{aligned}

Welcome to share or comment on this post:

Table of Contents