Skip to content

Causal-AIRL: MSc research code + interactive demo. 23pp↑ cross-style policy agreement via latent Z deconfounding. MSc Data Science @ Edinburgh 2024-25.

Notifications You must be signed in to change notification settings

govind104/causal-airl

Repository files navigation

Causal Inverse Reinforcement Learning for Robust Reward Recovery (GridWorld)

This repository contains the code and experiments for the MSc dissertation:

“Causal Inverse Reinforcement Learning for Robust Reward Recovery”
MSc Data Science, University of Edinburgh (2024–25)

🔗 Interactive Demo: Causal-AIRL Streamlit App

📄 Dissertation PDF: DISSERTATION.pdf

The project studies how to recover rewards that generalize across latent expert styles (unobserved confounders) using causal variants of inverse reinforcement learning in a discrete GridWorld setting.


High‑Level Idea

Inverse Reinforcement Learning (IRL) tries to infer a reward function that explains expert behaviour. Standard IRL assumes a single, consistent policy, but real experts differ in:

  • Risk tolerance and preferences
  • Skill level and habits
  • Contextual knowledge about the environment

These unobserved factors act as latent confounders. If ignored, an IRL algorithm can mistake style for intent, learning a reward that overfits to one expert style and fails to generalize.

Causal-AIRL extends Adversarial IRL by explicitly modelling these confounders and enforcing reward invariance: the learned reward should be stable even when expert style changes.


Methods in This Repository

The code implements four main IRL methods on GridWorld:

  • Ng–Russell IRL – classical feature-based IRL
  • MaxEnt IRL – maximum entropy IRL for stochastic experts
  • AIRL – adversarial IRL with potential‑based shaping
  • Causal-AIRL (novel) – AIRL extended with:
    • A CausalEncoder that infers a latent variable (Z) from state–action pairs
    • A CausalDiscriminator whose reward head is conditioned on (Z)
    • An invariance regulariser that penalizes variance of the reward across samples of (Z)

All methods are evaluated on a family of GridWorld tasks, with both clean and confounded variants.


Experimental Design

Experiments are organized around several questions:

  • Baselines: Do classical IRL methods behave as expected on clean GridWorld tasks?
  • Hyperparameters: How sensitive are AIRL and Causal-AIRL to learning rate, entropy regularization, and KL terms?
  • Scenario sweeps: How do discount factor, number of demonstrations, slip probability, and reward shaping affect reward recovery?
  • Confounded setting: What happens when expert demonstrations are generated from mixtures of styles (different (Z))?
  • Generalization: If trained on one style (e.g. (Z = 0)), how well does the learned reward support policies for another style (e.g. (Z = 1))?
  • Scaling: How do methods trade off wall‑clock time vs. quality as the grid size grows?

Results are logged under results/ (metrics, trajectories, learned rewards) and visualised via scripts in visualisation/.


Key Findings

Across the experiments, the dissertation reports that:

  • On clean (non‑confounded) tasks, Causal-AIRL matches AIRL, preserving performance when causal structure is not needed.
  • In confounded scenarios, Causal-AIRL achieves substantially higher cross‑style (cross‑Z) policy agreement compared to AIRL (around 20–23 percentage points improvement in the main GridWorld setting).
  • The learned reward under Causal-AIRL exhibits lower variance across latent styles (approx. 5× lower reward variance than AIRL on key benchmarks), indicating a more stable, style‑invariant reward signal.
  • These gains are obtained with competitive compute cost, giving a better quality‑vs‑wall‑time tradeoff than baseline AIRL in the tested configurations.

For an interactive view of trajectories, reward heatmaps, and cross‑Z metrics, see the Streamlit demo linked above.


Repository Structure

envs/              # GridWorld environments, including ConfoundedGridWorld
irl/               # Implementations of Ng–Russell, MaxEnt IRL, AIRL, Causal-AIRL
experiments/       # Training and evaluation drivers, metrics computation
visualisation/     # Figure-generation and plotting utilities
configs/           # YAML configs for scenarios, hyperparameters, and confounded setups
scripts/           # Shell scripts orchestrating the main experiment suites
results/           # (Optional) Saved runs: metrics, learned rewards, policies

All data used in the experiments are synthetic, generated by the environments in envs/ and experiment runners in experiments/. No external datasets are required.


Citation & License

If you build on this work, please cite:

Govind Arun Nampoothiri, Causal Inverse Reinforcement Learning for Robust Reward Recovery, MSc Dissertation, University of Edinburgh, 2024–25.

License: All rights reserved — The University of Edinburgh.

About

Causal-AIRL: MSc research code + interactive demo. 23pp↑ cross-style policy agreement via latent Z deconfounding. MSc Data Science @ Edinburgh 2024-25.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published