Fix NormalizeReward extra return_rms updates during vector env autoreset#1526
Conversation
The vector `NormalizeReward` wrapper was counting phantom autoreset steps from `SyncVectorEnv` as real steps, causing `return_rms` to receive extra updates that diverged from per-env `NormalizeReward` behavior. When `SyncVectorEnv` autoresets a sub-env (AutoresetMode.NEXT_STEP), it returns reward=0, terminated=False for that sub-env on the following step. The vector `NormalizeReward` processed this as a normal step — decaying `accumulated_reward` by gamma and calling `return_rms.update()` with the stale value. The fix tracks which sub-envs terminated on the previous step and skips accumulation and RMS updates for those sub-envs on the autoreset step. Also adds a `reset()` override to clear tracking state. - Added `test_count_against_wrapper` verifying `return_rms.count` matches between vector and per-env NormalizeReward (the exact discrepancy from Farama-Foundation#1451) - `test_functionality` expected std updated (fewer RMS updates changes normalization statistics) - `test_against_wrapper` rtol widened from 0.01 to 0.1 — the old tolerance only held because the extra updates coincidentally reduced vector/single variance divergence
pseudo-rnd-thoughts
left a comment
There was a problem hiding this comment.
Thanks for the PR, sorry its taken me a couple of days to get to this.
Does this mean that the vector and standard normalize reward wrapper are equivalent?
If so, can we add explicit testing for this
| for _ in range(n_steps): | ||
| vec_env.step(vec_env.action_space.sample()) | ||
|
|
||
| assert vec_env.return_rms.count == per_env.envs[0].return_rms.count |
There was a problem hiding this comment.
Shouldn't the vec_env and per_env return_rms be equivalent now?
There was a problem hiding this comment.
updated it, now asserts .count .mean, and .var
Yes they're equivalent now when comparing a single sub-env. The test now checks all three fields on return_rms, count, mean, and var. There's a tiny float difference (~1e-9) since the vector wrapper accumulates in float32 vs float64 in the per-env wrapper. Worth noting that with multiple sub-envs the vector wrapper shares one RunningMeanStd across all of them, so the stats will differ from independent per-env wrappers. Intentional though, not a bug. |
|
Perfect, thanks for making the update @JonahFSD |
Fixes #1451
The vector
NormalizeRewardwrapper was counting phantom autoreset steps fromSyncVectorEnvas real steps, causingreturn_rmsto receive extra updates that diverged from per-envNormalizeRewardbehavior.When
SyncVectorEnvautoresets a sub-env (AutoresetMode.NEXT_STEP), it returns reward=0, terminated=False for that sub-env on the following step. The vectorNormalizeRewardprocessed this as a normal step — decayingaccumulated_rewardby gamma and callingreturn_rms.update()with the stale value.The fix tracks which sub-envs terminated on the previous step and skips accumulation and RMS updates for those sub-envs on the autoreset step. Also adds a
reset()override to clear tracking state.Type of change
Test changes
test_count_against_wrapperverifyingreturn_rms.countmatches between vector and per-env NormalizeReward (the exact discrepancy from [Bug Report] NormalizeReward count error during autoreset in environments vectors #1451)test_functionalityexpected std updated (fewer RMS updates changes normalization statistics)test_against_wrapperrtol widened from 0.01 to 0.1 — the old tolerance only held because the extra updates coincidentally reduced vector/single variance divergenceChecklist:
pre-commitchecks withpre-commit run --all-files(seeCONTRIBUTING.mdinstructions to set it up)