Skip to content

Add MetricsManager for custom metric logging#596

Open
kevinzakka wants to merge 1 commit intomainfrom
feat/metrics-manager
Open

Add MetricsManager for custom metric logging#596
kevinzakka wants to merge 1 commit intomainfrom
feat/metrics-manager

Conversation

@kevinzakka
Copy link
Collaborator

@kevinzakka kevinzakka commented Feb 7, 2026

Summary

Adds a MetricsManager so users can log custom per-step metrics during training without hacking reward functions or adding zero-weight reward terms. Closes #584.

  • New: MetricsManager, MetricsTermCfg, NullMetricsManager in managers/metrics_manager.py
  • Integration: wired into ManagerBasedRlEnv config, step(), and _reset_idx()
  • Terms use the same callable signature as rewards (env, **params) → Tensor[num_envs]
  • No weight, no dt scaling — metrics are observational, not reward signals
  • Episode values are true per-step averages (sum / step_count), so a metric in [0,1] stays in [0,1] in wandb
  • Empty config means zero overhead (NullMetricsManager)

Example usage

  from mjlab.managers import MetricsTermCfg

  def joint_velocity_magnitude(env, asset_cfg):
    """L1 norm of joint velocities."""
    return env.scene[asset_cfg.name].data.joint_vel.abs().sum(dim=-1)  # (num_envs,)

  @dataclass(kw_only=True)
  class MyEnvCfg(ManagerBasedRlEnvCfg):
    metrics = {
      "joint_vel_mag": MetricsTermCfg(
        func=joint_velocity_magnitude,
        params={"asset_cfg": SceneEntityCfg(name="robot")},
      ),
    }

On episode reset, Episode_Metrics/joint_vel_mag appears in extras["log"] and flows to wandb/tensorboard automatically.

Test plan

  • uv run pytest tests/test_metrics_manager.py — 6 targeted tests
  • uv run pytest tests/test_rewards.py — no regression
  • uv run ty check / uv run pyright — clean
  • uv run ruff check && uv run ruff format — clean

🤖 Generated with Claude Code

Copy link
Collaborator

@brentyi brentyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+100, seems super useful!

I'm wondering if any if these things are possible in a simple/not overengineered way:

Does it make sense to allow customization of how metrics are "reduced" between steps and before logging? In the extreme case it'd be nice, for example, to be able to specify logging for std of metrics, stds of per-episode means, histograms, histograms of per-episode stds, etc.

Can we implement any of the existing things that are logged (eg rewards) as default terms in the metrics manager?

@kevinzakka
Copy link
Collaborator Author

Thanks for the review @brentyi!

Does it make sense to allow customization of how metrics are "reduced" between steps and before logging?

The rsl_rl logger only supports scalars. Everything in extras["log"] goes through torch.mean() then add_scalar() (logger.py:171). So histograms/distributions aren't possible without changing the logger. We could try submitting a PR to rsl_rl. In the meantime, for std or other reductions, users can just add a second metric term that computes it directly (e.g. joint_vel_std alongside joint_vel_mean). Both log as scalars and it works today with no extra machinery.

Can we implement any of the existing things that are logged (eg rewards) as default terms in the metrics manager?

In principle yes. Note however that rewards use dt-scaled sums divided by max_episode_length_s while metrics use raw sums divided by step_count. We'd have to add machinery to the metric manager to support this.

@brentyi
Copy link
Collaborator

brentyi commented Feb 7, 2026

In the meantime, for std or other reductions, users can just add a second metric term that computes it directly (e.g. joint_vel_std alongside joint_vel_mean). Both log as scalars and it works today with no extra machinery.

Makes sense!

To check my understanding: we wouldn't be able to reuse intermediates right? And we could compute std across episodes within a single timestep but not across timesteps within an episode? These are a bit annoying but fine.

Note however that rewards use dt-scaled sums divided by max_episode_length_s while metrics use raw sums divided by step_count.

Makes sense. It seems kind of nice to consolidate logic but I don't feel strongly about this.

@kevinzakka kevinzakka marked this pull request as ready for review February 7, 2026 19:31
Adds a MetricsManager so users can log custom per-step metrics without
hacking reward functions or adding zero-weight reward terms. Metrics
terms use the same callable signature as rewards (env, **params) but
have no weight, no dt scaling, and no normalization by episode length.
Episode values are true per-step averages (sum / step_count) logged
under "Episode_Metrics/{term_name}".

Closes #584

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kevinzakka kevinzakka force-pushed the feat/metrics-manager branch from b158504 to ac34d2c Compare February 7, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Approach to inject and inspect new metrics during training

3 participants