Causal Inverse Reinforcement Learning for Robust Reward Recovery (GridWorld)

This repository contains the code and experiments for the MSc dissertation:

“Causal Inverse Reinforcement Learning for Robust Reward Recovery”
MSc Data Science, University of Edinburgh (2024–25)

🔗 Interactive Demo: Causal-AIRL Streamlit App

📄 Dissertation PDF: DISSERTATION.pdf

The project studies how to recover rewards that generalize across latent expert styles (unobserved confounders) using causal variants of inverse reinforcement learning in a discrete GridWorld setting.

High‑Level Idea

Inverse Reinforcement Learning (IRL) tries to infer a reward function that explains expert behaviour. Standard IRL assumes a single, consistent policy, but real experts differ in:

Risk tolerance and preferences
Skill level and habits
Contextual knowledge about the environment

These unobserved factors act as latent confounders. If ignored, an IRL algorithm can mistake style for intent, learning a reward that overfits to one expert style and fails to generalize.

Causal-AIRL extends Adversarial IRL by explicitly modelling these confounders and enforcing reward invariance: the learned reward should be stable even when expert style changes.

Methods in This Repository

The code implements four main IRL methods on GridWorld:

Ng–Russell IRL – classical feature-based IRL
MaxEnt IRL – maximum entropy IRL for stochastic experts
AIRL – adversarial IRL with potential‑based shaping
Causal-AIRL (novel) – AIRL extended with:
- A CausalEncoder that infers a latent variable (Z) from state–action pairs
- A CausalDiscriminator whose reward head is conditioned on (Z)
- An invariance regulariser that penalizes variance of the reward across samples of (Z)

All methods are evaluated on a family of GridWorld tasks, with both clean and confounded variants.

Experimental Design

Experiments are organized around several questions:

Baselines: Do classical IRL methods behave as expected on clean GridWorld tasks?
Hyperparameters: How sensitive are AIRL and Causal-AIRL to learning rate, entropy regularization, and KL terms?
Scenario sweeps: How do discount factor, number of demonstrations, slip probability, and reward shaping affect reward recovery?
Confounded setting: What happens when expert demonstrations are generated from mixtures of styles (different (Z))?
Generalization: If trained on one style (e.g. (Z = 0)), how well does the learned reward support policies for another style (e.g. (Z = 1))?
Scaling: How do methods trade off wall‑clock time vs. quality as the grid size grows?

Results are logged under results/ (metrics, trajectories, learned rewards) and visualised via scripts in visualisation/.

Key Findings

Across the experiments, the dissertation reports that:

On clean (non‑confounded) tasks, Causal-AIRL matches AIRL, preserving performance when causal structure is not needed.
In confounded scenarios, Causal-AIRL achieves substantially higher cross‑style (cross‑Z) policy agreement compared to AIRL (around 20–23 percentage points improvement in the main GridWorld setting).
The learned reward under Causal-AIRL exhibits lower variance across latent styles (approx. 5× lower reward variance than AIRL on key benchmarks), indicating a more stable, style‑invariant reward signal.
These gains are obtained with competitive compute cost, giving a better quality‑vs‑wall‑time tradeoff than baseline AIRL in the tested configurations.

For an interactive view of trajectories, reward heatmaps, and cross‑Z metrics, see the Streamlit demo linked above.

Repository Structure

envs/              # GridWorld environments, including ConfoundedGridWorld
irl/               # Implementations of Ng–Russell, MaxEnt IRL, AIRL, Causal-AIRL
experiments/       # Training and evaluation drivers, metrics computation
visualisation/     # Figure-generation and plotting utilities
configs/           # YAML configs for scenarios, hyperparameters, and confounded setups
scripts/           # Shell scripts orchestrating the main experiment suites
results/           # (Optional) Saved runs: metrics, learned rewards, policies

All data used in the experiments are synthetic, generated by the environments in envs/ and experiment runners in experiments/. No external datasets are required.

Citation & License

If you build on this work, please cite:

Govind Arun Nampoothiri, Causal Inverse Reinforcement Learning for Robust Reward Recovery, MSc Dissertation, University of Edinburgh, 2024–25.

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
configs		configs
envs		envs
experiments		experiments
irl		irl
models		models
scripts		scripts
visualisation		visualisation
.gitignore		.gitignore
Cookbook.md		Cookbook.md
DISSERTATION.pdf		DISSERTATION.pdf
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Causal Inverse Reinforcement Learning for Robust Reward Recovery (GridWorld)

High‑Level Idea

Methods in This Repository

Experimental Design

Key Findings

Repository Structure

Citation & License

About

Uh oh!

Releases 1

Packages

Languages

govind104/causal-airl

Folders and files

Latest commit

History

Repository files navigation

Causal Inverse Reinforcement Learning for Robust Reward Recovery (GridWorld)

High‑Level Idea

Methods in This Repository

Experimental Design

Key Findings

Repository Structure

Citation & License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages