You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/14_policy-gradient.qmd
+2-4Lines changed: 2 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -376,8 +376,6 @@ $$
376
376
377
377
REINFORCE with baseline remains a Monte Carlo method: it requires full-episode returns and performs updates only after the episode ends. It still provides unbiased estimates of the policy gradient. The improvement is purely variance reduction, which can significantly accelerate learning. Empirically, adding a learned baseline commonly leads to much faster convergence, especially when episode returns vary widely.
378
378
379
-
These ideas capture the essence of REINFORCE with baseline: the gradient direction is preserved, variance is reduced, and learning becomes more efficient.
380
-
381
379
**Key formulas**
382
380
383
381
$$
@@ -485,7 +483,7 @@ $$
485
483
486
484
To construct a Monte Carlo-style policy gradient, one replaces $q_\pi$ with sampled differential returns. The resulting gradient estimate is:
As in the episodic setting, baselines can be used to reduce variance. The baseline must not depend on the action, and a natural choice is the differential value function $v_\pi(s)$. Subtracting it leads to:
491
489
$$
@@ -518,7 +516,7 @@ $$
518
516
$$
519
517
so that gradients concerning the parameters can be calculated and employed in policy gradient updates.
520
518
521
-
A common parameterisation is the univariate Gaussian or Normal distribution:
519
+
A common parametrisation is the univariate Gaussian or Normal distribution:
This update is unbiased but typically has very high variance. To reduce this variance, a baseline function $b(s)$ can be subtracted from the return. This does not change the expected value of the gradient but can greatly improve learning stability.
160
-
161
-
The key idea is to replace the return $G_t$ with the advantage-like term $G_t - b(S_t)$. The new update rule becomes:
155
+
- The REINFORCE algorithm use the full MC return (often very high variance).
156
+
- To reduce this variance/stability, a baseline $b(s)$ can be subtracted from the return.
157
+
- Replace the return $G_t$ with $G_t - b(S_t)$. The new update rule becomes:
The baseline may depend on the state but must not depend on the action. If it did depend on the action, it would bias the estimate of the gradient. The reason it does not introduce bias is:
162
+
-The baseline may depend on the state but must not depend on the action. Hence
Thus, subtracting $b(s)$ alters only variance, not the expectation.
166
+
- Subtracting $b(s)$ alters only variance, not the expectation.
167
+
- An effective choice is using the approx. state-value $b(s) = \hat v(s, w)$ with updates
168
+
$$w \leftarrow w + \alpha_w\,(G_t - \hat v(S_t,w))\,\nabla \hat v(S_t,w).$$
171
169
172
-
A natural and effective choice for the baseline is the state-value function:
173
-
$$
174
-
b(s) = \hat v(s, w),
175
-
$$
176
-
where the parameter vector $w$ is learned from data. The value-function parameters are updated by a Monte Carlo regression method:
177
-
$$
178
-
w \leftarrow w + \alpha_w\,(G_t - \hat v(S_t,w))\,\nabla \hat v(S_t,w).
179
-
$$
180
-
This produces a *critic* that approximates how good each state is on average. The policy update (the *actor*) then adjusts the probabilities in proportion to how much better or worse the return was compared to what is expected for the state:
REINFORCE with baseline remains a Monte Carlo method: it requires full-episode returns and performs updates only after the episode ends. It still provides unbiased estimates of the policy gradient. The improvement is purely variance reduction, which can significantly accelerate learning. Empirically, adding a learned baseline commonly leads to much faster convergence, especially when episode returns vary widely.
170
+
---
186
171
187
-
These ideas capture the essence of REINFORCE with baseline: the gradient direction is preserved, variance is reduced, and learning becomes more efficient.
leads to a combined learning rule for actor and critic:
175
+
- This produces a *critic* that approximates how good each state is on average.
176
+
- The policy update (the *actor*) then adjusts the probabilities in proportion to how much better or worse the return was compared to what is expected for the state.
177
+
- Still a Monte Carlo method.
178
+
- Still provides unbiased estimates of the policy gradient.
179
+
- The improvement is purely variance reduction to accelerate learning.
180
+
- Empirically, leads to much faster convergence.
181
+
- We now have both learning rules for actor and critic:
203
182
$$
204
183
\begin{aligned}
205
184
w &\leftarrow w + \alpha_w\,(G_t - \hat v(S_t,w))\,\nabla \hat v(S_t,w), \\
Pseudo code for REINFORCE with baseline algorithm is given in @fig-reinforce-baseline-alg.
190
+
## REINFORCE with Baseline algorithm
211
191
212
-
```{r fig-reinforce-baseline-alg, echo=FALSE, fig.cap="REINFORCE with baseline: Monte Carlo Policy Gradient Control (episodic) [@Sutton18].", out.width="90%"}
- The policy gradient theorem still holds (now with equal sign) $$\nabla r(\pi) = \sum_s \mu(s) \sum_a q_\pi(s,a)\,\nabla \pi(a|s,\theta).$$
224
+
- Value functions are as before except that the return is defined as the *differential value*: $$G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + \cdots.$$
225
+
226
+
---
227
+
228
+
## Policy Gradient for Continuing Problems (2)
229
+
230
+
- The gradient with a baseline then becomes $$\nabla r(\pi) \approx \mathbb{E}\left[(G_t-b(s))\,\nabla \ln \pi(A_t|S_t,\theta)\right].
231
+
$$
232
+
- If use TD and let the baseline be the state-value, then $$G_t - b(s) \approx \delta_t = (R_{t+1} - \hat r + \hat v(S_{t+1})) - \hat v(S_t)$$
233
+
- Now also the average reward $r(\pi)$ must be estimated during learning $(\hat r)$.
234
+
- Policy gradient methods extend naturally to the continuing case, but the formulation shifts from episodic returns to average reward and differential values.
## Policy Parameterisation for Continuous Actions (1)
247
+
248
+
- Consider *continuous action spaces*, meaning actions are real-valued (or vector-valued).
249
+
- Policies are *parameterised probability density functions* over continuous actions $$\pi(a \mid s, \theta) = \text{a differentiable density over } a$$
250
+
- A common parametrisation is the univariate Gaussian or Normal distribution:
where both the mean $\mu(s)$ and standard deviation $\sigma(s)$ may depend on the state and are parametrised by separate sets of weights $\theta = (\theta_\mu, \theta_\sigma)$.
255
+
- The mean and variance can be $$\mu(s, \theta) = {\theta_\mu}^\top \textbf x_\mu(s), \qquad \sigma^2(s, \theta) = \exp({\theta_\sigma}^\top \textbf x_\sigma(s)).$$
256
+
257
+
---
258
+
259
+
## Policy Parameterisation for Continuous Actions (2)
260
+
261
+
- The eligibility vector $\nabla \ln \pi(A_t|S_t, \theta_t)$ becomes:
0 commit comments