Skip to content

Commit 6629a45

Browse files
committed
Update slides
1 parent 271a560 commit 6629a45

File tree

5 files changed

+236
-87
lines changed

5 files changed

+236
-87
lines changed

book/14_policy-gradient.qmd

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -376,8 +376,6 @@ $$
376376

377377
REINFORCE with baseline remains a Monte Carlo method: it requires full-episode returns and performs updates only after the episode ends. It still provides unbiased estimates of the policy gradient. The improvement is purely variance reduction, which can significantly accelerate learning. Empirically, adding a learned baseline commonly leads to much faster convergence, especially when episode returns vary widely.
378378

379-
These ideas capture the essence of REINFORCE with baseline: the gradient direction is preserved, variance is reduced, and learning becomes more efficient.
380-
381379
**Key formulas**
382380

383381
$$
@@ -485,7 +483,7 @@ $$
485483

486484
To construct a Monte Carlo-style policy gradient, one replaces $q_\pi$ with sampled differential returns. The resulting gradient estimate is:
487485
$$
488-
\nabla r(\pi) \approx \mathbb{E}\left[(G_t)\,\nabla \ln \pi(A_t|S_t,\theta)\right].
486+
\nabla r(\pi) \approx \mathbb{E}\left[G_t\,\nabla \ln \pi(A_t|S_t,\theta)\right].
489487
$$
490488
As in the episodic setting, baselines can be used to reduce variance. The baseline must not depend on the action, and a natural choice is the differential value function $v_\pi(s)$. Subtracting it leads to:
491489
$$
@@ -518,7 +516,7 @@ $$
518516
$$
519517
so that gradients concerning the parameters can be calculated and employed in policy gradient updates.
520518

521-
A common parameterisation is the univariate Gaussian or Normal distribution:
519+
A common parametrisation is the univariate Gaussian or Normal distribution:
522520
$$
523521
\pi(a \mid s, \theta) = \frac{1}{\sqrt{2\pi\sigma^2(s, \theta)}} \exp\left( -\frac{(a - \mu(s, \theta))^2}{2\sigma^2(s, \theta)} \right),
524522
$$

slides/14_policy-gradient-slides.Rmd

Lines changed: 119 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -150,69 +150,148 @@ Note an discount rate have been added here (we didn't include it in the policy g
150150

151151
---
152152

153-
## REINFORCE with Baseline
153+
## REINFORCE with Baseline (1)
154154

155-
The original REINFORCE algorithm updates the policy parameters using the full Monte Carlo return:
156-
$$
157-
\theta_{t+1} = \theta_t + \alpha\,G_t\,\nabla \ln \pi(A_t|S_t,\theta_t).
158-
$$
159-
This update is unbiased but typically has very high variance. To reduce this variance, a baseline function $b(s)$ can be subtracted from the return. This does not change the expected value of the gradient but can greatly improve learning stability.
160-
161-
The key idea is to replace the return $G_t$ with the advantage-like term $G_t - b(S_t)$. The new update rule becomes:
155+
- The REINFORCE algorithm use the full MC return (often very high variance).
156+
- To reduce this variance/stability, a baseline $b(s)$ can be subtracted from the return.
157+
- Replace the return $G_t$ with $G_t - b(S_t)$. The new update rule becomes:
162158
$$
163159
\theta_{t+1}
164160
= \theta_t + \alpha\,(G_t - b(S_t))\,\nabla \ln \pi(A_t|S_t,\theta_t).
165161
$$
166-
The baseline may depend on the state but must not depend on the action. If it did depend on the action, it would bias the estimate of the gradient. The reason it does not introduce bias is:
162+
- The baseline may depend on the state but must not depend on the action. Hence
167163
$$
168164
\sum_a b(s)\,\nabla \pi(a|s,\theta) = b(s)\,\nabla \sum_a \pi(a|s,\theta) = b(s)\,\nabla 1 = 0.
169165
$$
170-
Thus, subtracting $b(s)$ alters only variance, not the expectation.
166+
- Subtracting $b(s)$ alters only variance, not the expectation.
167+
- An effective choice is using the approx. state-value $b(s) = \hat v(s, w)$ with updates
168+
$$w \leftarrow w + \alpha_w\,(G_t - \hat v(S_t,w))\,\nabla \hat v(S_t,w).$$
171169

172-
A natural and effective choice for the baseline is the state-value function:
173-
$$
174-
b(s) = \hat v(s, w),
175-
$$
176-
where the parameter vector $w$ is learned from data. The value-function parameters are updated by a Monte Carlo regression method:
177-
$$
178-
w \leftarrow w + \alpha_w\,(G_t - \hat v(S_t,w))\,\nabla \hat v(S_t,w).
179-
$$
180-
This produces a *critic* that approximates how good each state is on average. The policy update (the *actor*) then adjusts the probabilities in proportion to how much better or worse the return was compared to what is expected for the state:
181-
$$
182-
\theta \leftarrow \theta + \alpha_\theta\,(G_t - \hat v(S_t,w))\,\nabla \ln \pi(A_t|S_t,\theta).
183-
$$
184-
185-
REINFORCE with baseline remains a Monte Carlo method: it requires full-episode returns and performs updates only after the episode ends. It still provides unbiased estimates of the policy gradient. The improvement is purely variance reduction, which can significantly accelerate learning. Empirically, adding a learned baseline commonly leads to much faster convergence, especially when episode returns vary widely.
170+
---
186171

187-
These ideas capture the essence of REINFORCE with baseline: the gradient direction is preserved, variance is reduced, and learning becomes more efficient.
172+
## REINFORCE with Baseline (2)
188173

189-
**Key formulas**
190174

191-
$$
192-
\nabla J(\theta) \propto \mathbb{E}_\pi[(G_t - b(S_t))\,\nabla \ln \pi(A_t|S_t,\theta)],
193-
$$
194-
and the baseline restriction:
195-
$$
196-
b(s)\text{ must not depend on }a.
197-
$$
198-
Using a value-function baseline:
199-
$$
200-
b(s) = \hat v(s,w),
201-
$$
202-
leads to a combined learning rule for actor and critic:
175+
- This produces a *critic* that approximates how good each state is on average.
176+
- The policy update (the *actor*) then adjusts the probabilities in proportion to how much better or worse the return was compared to what is expected for the state.
177+
- Still a Monte Carlo method.
178+
- Still provides unbiased estimates of the policy gradient.
179+
- The improvement is purely variance reduction to accelerate learning.
180+
- Empirically, leads to much faster convergence.
181+
- We now have both learning rules for actor and critic:
203182
$$
204183
\begin{aligned}
205184
w &\leftarrow w + \alpha_w\,(G_t - \hat v(S_t,w))\,\nabla \hat v(S_t,w), \\
206185
\theta &\leftarrow \theta + \alpha_\theta\,(G_t - \hat v(S_t,w))\,\nabla \ln \pi(A_t|S_t,\theta).
207186
\end{aligned}
208187
$$
188+
---
209189

210-
Pseudo code for REINFORCE with baseline algorithm is given in @fig-reinforce-baseline-alg.
190+
## REINFORCE with Baseline algorithm
211191

212-
```{r fig-reinforce-baseline-alg, echo=FALSE, fig.cap="REINFORCE with baseline: Monte Carlo Policy Gradient Control (episodic) [@Sutton18].", out.width="90%"}
192+
```{r, echo=FALSE}
213193
knitr::include_graphics("img/1304_REINFORCE_With_Baseline.png")
214194
```
215195

196+
---
197+
198+
## Actor-Critic Methods
199+
200+
- Actor-critic methods replace the full MC return with a bootstrapped estimate.
201+
- The policy is the *actor* and the value function is the *critic*. The critic evaluates the state value, and the actor adjust the policy parameters.
202+
- Now let the critic use TD updates (faster updates and less variance). TD error:
203+
$$\delta_t = R_{t+1} + \gamma \hat v(S_{t+1}, w) - \hat v(S_t, w).$$
204+
- The critic update becomes:
205+
$$w \leftarrow w + \alpha_w \,\delta_t\, \nabla \hat v(S_t, w).$$
206+
The actor update becomes (with bias but lower variance):
207+
$$\theta \leftarrow \theta + \alpha_\theta\,\delta_t\,\nabla \ln \pi(A_t|S_t, \theta).$$
208+
- Actor-critic methods can be seen as the policy-gradient analogue of SARSA.
209+
210+
---
211+
212+
## Actor-Critic algorithm
213+
214+
```{r, echo=FALSE, out.width="90%"}
215+
knitr::include_graphics("img/1305a_One_Step_Actor_Critic.png")
216+
```
217+
218+
---
219+
220+
## Policy Gradient for Continuing Problems (1)
221+
222+
- New objective *average reward*: $$J(\theta) = r(\pi) = \sum_s \mu(s)\sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)\, r.$$
223+
- The policy gradient theorem still holds (now with equal sign) $$\nabla r(\pi) = \sum_s \mu(s) \sum_a q_\pi(s,a)\,\nabla \pi(a|s,\theta).$$
224+
- Value functions are as before except that the return is defined as the *differential value*: $$G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + \cdots.$$
225+
226+
---
227+
228+
## Policy Gradient for Continuing Problems (2)
229+
230+
- The gradient with a baseline then becomes $$\nabla r(\pi) \approx \mathbb{E}\left[(G_t-b(s))\,\nabla \ln \pi(A_t|S_t,\theta)\right].
231+
$$
232+
- If use TD and let the baseline be the state-value, then $$G_t - b(s) \approx \delta_t = (R_{t+1} - \hat r + \hat v(S_{t+1})) - \hat v(S_t)$$
233+
- Now also the average reward $r(\pi)$ must be estimated during learning $(\hat r)$.
234+
- Policy gradient methods extend naturally to the continuing case, but the formulation shifts from episodic returns to average reward and differential values.
235+
236+
---
237+
238+
## Policy Gradient algorithm (continuing case)
239+
240+
```{r, echo=FALSE, out.width="80%"}
241+
knitr::include_graphics("img/1306_actor-critic-cont.png")
242+
```
243+
244+
---
245+
246+
## Policy Parameterisation for Continuous Actions (1)
247+
248+
- Consider *continuous action spaces*, meaning actions are real-valued (or vector-valued).
249+
- Policies are *parameterised probability density functions* over continuous actions $$\pi(a \mid s, \theta) = \text{a differentiable density over } a$$
250+
- A common parametrisation is the univariate Gaussian or Normal distribution:
251+
$$
252+
\pi(a \mid s, \theta) = \frac{1}{\sqrt{2\pi\sigma^2(s, \theta)}} \exp\left( -\frac{(a - \mu(s, \theta))^2}{2\sigma^2(s, \theta)} \right),
253+
$$
254+
where both the mean $\mu(s)$ and standard deviation $\sigma(s)$ may depend on the state and are parametrised by separate sets of weights $\theta = (\theta_\mu, \theta_\sigma)$.
255+
- The mean and variance can be $$\mu(s, \theta) = {\theta_\mu}^\top \textbf x_\mu(s), \qquad \sigma^2(s, \theta) = \exp({\theta_\sigma}^\top \textbf x_\sigma(s)).$$
256+
257+
---
258+
259+
## Policy Parameterisation for Continuous Actions (2)
260+
261+
- The eligibility vector $\nabla \ln \pi(A_t|S_t, \theta_t)$ becomes:
262+
$$\nabla \ln \pi(a|s, \theta)
263+
= \frac{a-\mu(s, \theta_\mu)}{\sigma(s, \theta_\sigma)^2}\, \textbf x(s, \theta_\mu) +
264+
\left(\frac{(a-\mu(s, \theta_\mu))^2}{\sigma(s, \theta_\sigma)^2} - 1\right)
265+
\textbf x(s, \theta_\sigma).$$
266+
- The choice of parametrization has important effects.
267+
- If the variance is too small, exploration collapses; if too large, gradient estimates become noisy.
268+
- Learning both mean and variance enables adaptive exploration: the variance shrinks in well-understood regions and grows where uncertainty is higher.
269+
- Once a differentiable density is available, all previous machinery for policy gradients applies unchanged.
270+
- The policy gradient theorem still holds, as it does not depend on action space cardinality.
271+
- Actor-critic methods remain preferable because they reduce variance.
272+
273+
---
274+
275+
## Mixed Action Spaces
276+
277+
- The action includes both continuous and discrete components $a = (a^{\text{disc}},\, a^{\text{cont}}).$
278+
- The policy must represent a joint distribution over this mixed action space.
279+
- Policy gradient methods handle this naturally as long as the policy is differentiable.
280+
- A standard and convenient factorization is:
281+
$$
282+
\pi(a \mid s)
283+
= \pi(a^{\text{disc}} \mid s)\,
284+
\pi(a^{\text{cont}} \mid s, a^{\text{disc}}).
285+
$$
286+
- First choose the discrete action component. Then choose the continuous parameters conditioned on the discrete choice.
287+
- The log-policy splits naturally: $$\ln \pi(a \mid s) = \ln \pi(a^{\text{disc}} \mid s) + \ln \pi(a^{\text{cont}} \mid s, a^{\text{disc}}).$$
288+
$$\nabla_\theta \ln \pi(a \mid s)
289+
=
290+
\nabla_\theta \ln \pi(a^{\text{disc}} \mid s)
291+
+
292+
\nabla_\theta \ln \pi(a^{\text{cont}} \mid s, a^{\text{disc}}).$$
293+
294+
---
216295

217296
## Colab
218297

0 commit comments

Comments
 (0)