Skip to content

Fix NormalizeReward extra return_rms updates during vector env autoreset#1526

Merged
pseudo-rnd-thoughts merged 2 commits intoFarama-Foundation:mainfrom
JonahFSD:fix/issue-1451
Feb 19, 2026
Merged

Fix NormalizeReward extra return_rms updates during vector env autoreset#1526
pseudo-rnd-thoughts merged 2 commits intoFarama-Foundation:mainfrom
JonahFSD:fix/issue-1451

Conversation

@JonahFSD
Copy link
Contributor

@JonahFSD JonahFSD commented Feb 8, 2026

Fixes #1451

The vector NormalizeReward wrapper was counting phantom autoreset steps from SyncVectorEnv as real steps, causing return_rms to receive extra updates that diverged from per-env NormalizeReward behavior.

When SyncVectorEnv autoresets a sub-env (AutoresetMode.NEXT_STEP), it returns reward=0, terminated=False for that sub-env on the following step. The vector NormalizeReward processed this as a normal step — decaying accumulated_reward by gamma and calling return_rms.update() with the stale value.

The fix tracks which sub-envs terminated on the previous step and skips accumulation and RMS updates for those sub-envs on the autoreset step. Also adds a reset() override to clear tracking state.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Test changes

  • Added test_count_against_wrapper verifying return_rms.count matches between vector and per-env NormalizeReward (the exact discrepancy from [Bug Report] NormalizeReward count error during autoreset in environments vectors #1451)
  • test_functionality expected std updated (fewer RMS updates changes normalization statistics)
  • test_against_wrapper rtol widened from 0.01 to 0.1 — the old tolerance only held because the extra updates coincidentally reduced vector/single variance divergence

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

The vector `NormalizeReward` wrapper was counting phantom autoreset steps
from `SyncVectorEnv` as real steps, causing `return_rms` to receive extra
updates that diverged from per-env `NormalizeReward` behavior.

When `SyncVectorEnv` autoresets a sub-env (AutoresetMode.NEXT_STEP), it
returns reward=0, terminated=False for that sub-env on the following step.
The vector `NormalizeReward` processed this as a normal step — decaying
`accumulated_reward` by gamma and calling `return_rms.update()` with the
stale value.

The fix tracks which sub-envs terminated on the previous step and skips
accumulation and RMS updates for those sub-envs on the autoreset step.
Also adds a `reset()` override to clear tracking state.

- Added `test_count_against_wrapper` verifying `return_rms.count` matches
  between vector and per-env NormalizeReward (the exact discrepancy from Farama-Foundation#1451)
- `test_functionality` expected std updated (fewer RMS updates changes
  normalization statistics)
- `test_against_wrapper` rtol widened from 0.01 to 0.1 — the old tolerance
  only held because the extra updates coincidentally reduced vector/single
  variance divergence
Copy link
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, sorry its taken me a couple of days to get to this.
Does this mean that the vector and standard normalize reward wrapper are equivalent?
If so, can we add explicit testing for this

for _ in range(n_steps):
vec_env.step(vec_env.action_space.sample())

assert vec_env.return_rms.count == per_env.envs[0].return_rms.count

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the vec_env and per_env return_rms be equivalent now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated it, now asserts .count .mean, and .var

@JonahFSD
Copy link
Contributor Author

Thanks for the PR, sorry its taken me a couple of days to get to this. Does this mean that the vector and standard normalize reward wrapper are equivalent? If so, can we add explicit testing for this

Yes they're equivalent now when comparing a single sub-env. The test now checks all three fields on return_rms, count, mean, and var. There's a tiny float difference (~1e-9) since the vector wrapper accumulates in float32 vs float64 in the per-env wrapper. Worth noting that with multiple sub-envs the vector wrapper shares one RunningMeanStd across all of them, so the stats will differ from independent per-env wrappers. Intentional though, not a bug.

@pseudo-rnd-thoughts
Copy link
Member

Perfect, thanks for making the update @JonahFSD
All agreed about the limitations of float32 vs float64 and only one environment for testing equivalence

Copy link
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pseudo-rnd-thoughts pseudo-rnd-thoughts merged commit 08b7cb7 into Farama-Foundation:main Feb 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Report] NormalizeReward count error during autoreset in environments vectors

2 participants

Comments