This repository contains the code and experiments for the MSc dissertation:
“Causal Inverse Reinforcement Learning for Robust Reward Recovery”
MSc Data Science, University of Edinburgh (2024–25)
🔗 Interactive Demo: Causal-AIRL Streamlit App
📄 Dissertation PDF: DISSERTATION.pdf
The project studies how to recover rewards that generalize across latent expert styles (unobserved confounders) using causal variants of inverse reinforcement learning in a discrete GridWorld setting.
Inverse Reinforcement Learning (IRL) tries to infer a reward function that explains expert behaviour. Standard IRL assumes a single, consistent policy, but real experts differ in:
- Risk tolerance and preferences
- Skill level and habits
- Contextual knowledge about the environment
These unobserved factors act as latent confounders. If ignored, an IRL algorithm can mistake style for intent, learning a reward that overfits to one expert style and fails to generalize.
Causal-AIRL extends Adversarial IRL by explicitly modelling these confounders and enforcing reward invariance: the learned reward should be stable even when expert style changes.
The code implements four main IRL methods on GridWorld:
- Ng–Russell IRL – classical feature-based IRL
- MaxEnt IRL – maximum entropy IRL for stochastic experts
- AIRL – adversarial IRL with potential‑based shaping
- Causal-AIRL (novel) – AIRL extended with:
- A CausalEncoder that infers a latent variable (Z) from state–action pairs
- A CausalDiscriminator whose reward head is conditioned on (Z)
- An invariance regulariser that penalizes variance of the reward across samples of (Z)
All methods are evaluated on a family of GridWorld tasks, with both clean and confounded variants.
Experiments are organized around several questions:
- Baselines: Do classical IRL methods behave as expected on clean GridWorld tasks?
- Hyperparameters: How sensitive are AIRL and Causal-AIRL to learning rate, entropy regularization, and KL terms?
- Scenario sweeps: How do discount factor, number of demonstrations, slip probability, and reward shaping affect reward recovery?
- Confounded setting: What happens when expert demonstrations are generated from mixtures of styles (different (Z))?
- Generalization: If trained on one style (e.g. (Z = 0)), how well does the learned reward support policies for another style (e.g. (Z = 1))?
- Scaling: How do methods trade off wall‑clock time vs. quality as the grid size grows?
Results are logged under results/ (metrics, trajectories, learned rewards) and visualised via scripts in visualisation/.
Across the experiments, the dissertation reports that:
- On clean (non‑confounded) tasks, Causal-AIRL matches AIRL, preserving performance when causal structure is not needed.
- In confounded scenarios, Causal-AIRL achieves substantially higher cross‑style (cross‑Z) policy agreement compared to AIRL (around 20–23 percentage points improvement in the main GridWorld setting).
- The learned reward under Causal-AIRL exhibits lower variance across latent styles (approx. 5× lower reward variance than AIRL on key benchmarks), indicating a more stable, style‑invariant reward signal.
- These gains are obtained with competitive compute cost, giving a better quality‑vs‑wall‑time tradeoff than baseline AIRL in the tested configurations.
For an interactive view of trajectories, reward heatmaps, and cross‑Z metrics, see the Streamlit demo linked above.
envs/ # GridWorld environments, including ConfoundedGridWorld
irl/ # Implementations of Ng–Russell, MaxEnt IRL, AIRL, Causal-AIRL
experiments/ # Training and evaluation drivers, metrics computation
visualisation/ # Figure-generation and plotting utilities
configs/ # YAML configs for scenarios, hyperparameters, and confounded setups
scripts/ # Shell scripts orchestrating the main experiment suites
results/ # (Optional) Saved runs: metrics, learned rewards, policies
All data used in the experiments are synthetic, generated by the environments in envs/ and experiment runners in experiments/. No external datasets are required.
If you build on this work, please cite:
Govind Arun Nampoothiri, Causal Inverse Reinforcement Learning for Robust Reward Recovery, MSc Dissertation, University of Edinburgh, 2024–25.
License: All rights reserved — The University of Edinburgh.