Insurance Portfolio Digital Twin

Synthetic personal-lines insurance portfolio built as a governed digital twin,
with dataset freezing, validation gates, and actuarial realism.

This repository evolves in phases, each adding analytical depth while preserving governance, reproducibility, and auditability.

Designed to mirror how regulated insurance analytics platforms are built internally, rather than how public modelling demos are typically presented.

Project Phases

This project is structured as a multi-phase insurance analytics build, where each phase produces a stable, defensible artefact before moving forward.

Phase 1 — Synthetic Insurance Universe & Governance (v0.1)

The focus of Phase 1 is not modelling —
it is data generation, governance, validation, and auditability.

Before pricing, fraud, forecasting, or scenario analysis can be trusted,
the underlying dataset must be frozen, reproducible, and defensible.

That is what Phase 1 delivers.

Why this project exists

In real insurance environments, analytical credibility depends on:

reproducibility
traceability
controlled imperfections
governance before modelling

Most public analytics projects skip these steps.

This project does not.

Phase 1 scope

Delivered in this repository:

✔ Synthetic personal-lines insurance universe
✔ Policyholders, policies, claims, macro environment
✔ Explicit modelling assumptions (documented in config.py)
✔ Controlled anomaly injection (real-world messiness)
✔ Validation gates (actuarial sanity checks)
✔ Dataset freeze with manifest and cryptographic hashes
✔ Auditable, versioned data artefact

Explicitly not included yet:

pricing models
fraud models
scenario simulators
dashboards or UI

These are added incrementally in later phases.

What makes this different

This repository treats synthetic data as a governed asset, not a toy dataset.

It includes:

deterministic generation via fixed random seeds
hash-based dataset locking
validation checks aligned to actuarial practice
anomaly rates that are intentional, rare, and bounded

This mirrors how internal insurance analytics platforms are built.

Phase 2 — Portfolio Mix & Premium Distributions (Pricing Context) (v0.2)

Phase 2 builds pricing context on top of the frozen dataset produced in Phase 1.

No data is regenerated or modified in this phase.

The objective is to answer the questions that pricing and actuarial teams ask before loss ratio modelling or rate changes:

What is the portfolio made of?
(product, channel, coverage composition)
How is premium distributed?
(mean vs median, dispersion, tails, concentration)
Where does modelling effort matter most financially?

Key outputs

✔ Portfolio mix diagnostics (product × channel × coverage)
✔ Premium dispersion and concentration analysis
✔ Tail contribution (top 1%, 5%, 10% of policies)
✔ Coverage → severity tail validation (P90 / P95 / P99)
✔ Explicit pricing design note (intentional weak risk differentiation)
✔ Leadership framing and portfolio steering implications

Phase 3 — Loss Ratio Drill-Down (Actuarial View) (v0.3)

Phase 3 introduces actuarial loss ratio analysis on the frozen synthetic portfolio, building directly on the pricing context established in Phase 2.

The focus of this phase is not model fitting —
it is profitability diagnosis and decision prioritisation using earned premium logic and premium-weighted views.

Loss ratios are treated as decision signals, not just summary metrics.

Key questions answered

Where is the portfolio making or losing money?
Which combinations of product × channel dominate financial risk?
Are adverse loss ratios driven by frequency, severity, or exposure mix?
Where would pricing, underwriting, or reinsurance review have the highest impact?

Key outputs

✔ Earned premium–based loss ratio calculations
✔ Premium-weighted aggregation (financial materiality lens)
✔ Product × Channel loss ratio heatmap (executive view)
✔ Clear separation of diagnosis vs modelling
✔ Explicit decision framing for pricing and portfolio steering

Phase 4 — Macro & CAT Scenario Sensitivity (Board View) (v0.4)

Phase 4 extends the digital twin from diagnosis into forward-looking stress testing and executive decision support.

This phase introduces a scenario engine that translates macroeconomic shocks and catastrophe events into portfolio-level paid loss impacts — both gross and net of reinsurance.

The focus is not forecasting. It is understanding exposure, sensitivity, and protection effectiveness under stress.

Key questions answered

How sensitive is the portfolio to inflation, repair costs, unemployment, and CAT events?
Which scenarios produce material paid-loss uplift?
Which products drive the stress impact?
How much risk is absorbed by reinsurance, and how much remains net?
Where does residual risk concentrate after QS + XL protection?

Key outputs

✔ Scenario engine driven by macro and CAT shocks
✔ Portfolio-level paid loss impact (gross view)
✔ Product-level attribution of scenario uplift
✔ Bootstrap uncertainty bands for key stresses
✔ Reinsurance effectiveness analysis (QS + XL)
✔ Executive-ready board packs (PPT)
✔ Interactive Streamlit scenario simulator for live decision exploration

Board artefacts

04_board_scenario_pack.pptx
Gross portfolio impact under macro & CAT scenarios
04_board_scenario_pack_with_RI.pptx
Gross → Net view with QS + XL reinsurance protection

These decks are structured for pricing, underwriting, and reinsurance committees, not exploratory analysis.

Executive simulator (Streamlit)

This phase also introduces an interactive executive scenario tool.

Features:

Live sliders for:
- Inflation shock
- Repair cost shock
- Unemployment shift
- CAT year override
Reinsurance structure controls:
- Quota share %
- XL retention
- XL limit
Instant visibility of:
- Gross paid loss
- Net paid loss after RI
- Risk transfer efficiency
- Product-level attribution

Location:

notebooks/ui/scenario_simulator_exec_demo.py

Purpose:

Convert static board analysis into a live decision conversation.

Phase 5 — Anomaly Audit & Model Robustness (Modelling Readiness Gate) (v0.5)

Phase 5 introduces a formal modelling-readiness certification layer on top of the frozen synthetic portfolio.

This phase does not fit predictive models.

Instead, it validates that the portfolio is structurally and statistically ready for controlled frequency modelling in the next phase.

In regulated insurance environments, predictive modelling does not begin until exposure integrity, anomaly bounds, and distributional assumptions have been formally validated.

Phase 5 mirrors that discipline.

Key questions answered

Is exposure derived correctly and free from structural inconsistencies?
Are anomalies rare, bounded, and explainable?
Does claim count exhibit statistically significant overdispersion?
Is Negative Binomial GLM justified over Poisson?
Is meaningful risk signal present across rating factors?
Is temporal leakage prevented before model fitting?

Key outputs

✔ Exposure derivation from policy start and end dates
✔ Detection and controlled handling of non-positive exposure cases (~0.1%)
✔ Poisson dispersion testing (Pearson χ² / dof ≈ 88)
✔ Formal justification for Negative Binomial frequency modelling
✔ Risk signal validation across rating dimensions (e.g. vehicle_age)
✔ Fraud-like structural clustering diagnostics
✔ Temporal train/test split to prevent forward-looking bias
✔ Phase 5 modelling-readiness certification

Why this matters

Before fitting a GLM, pricing teams must ensure:

exposure is structurally consistent
statistical assumptions are defensible
risk differentiation exists in the data
modelling pipelines are leakage-safe

Phase 5 ensures that the portfolio is not only analytically interesting, but statistically and procedurally ready for predictive modelling.

Phase 6 — Technical Frequency Model (Negative Binomial GLM) (v0.6)

Phase 6 introduces the first predictive modelling layer within the governed digital twin architecture.

This phase implements a Negative Binomial (NB2) claim frequency model, transitioning the project from modelling-readiness certification (Phase 5) into structured actuarial modelling.

The objective is not simply to fit a model.

It is to recover structured risk signal in a way that is:

statistically defensible
leakage-safe
commercially interpretable
deployable as rating factors

This phase mirrors how regulated pricing teams formally introduce predictive modelling into a governed environment.

Key questions answered

Is Negative Binomial statistically justified over Poisson?
Does the portfolio exhibit recoverable and stable risk differentiation?
Can claim frequency be modelled per policy-year using exposure offsets?
Is fraud correctly separated from the technical pricing base?
Does the model demonstrate out-of-sample calibration and lift?
Are rating relativities implementable and stable?

Key outputs

✔ Negative Binomial GLM with log(exposure) offset
✔ Formal overdispersion validation (Poisson vs NB comparison)
✔ Fraud excluded from technical frequency base
✔ Temporal train/test split to prevent leakage
✔ Decile calibration and ~4.3× lift validation
✔ Vehicle age banding with monotonicity checks
✔ Structured pricing relativities (exp(beta))
✔ Exported deployment-ready artefacts

Statistical validation

The portfolio exhibits statistically significant overdispersion, formally justifying the NB2 distribution.

The model demonstrates:

Strong out-of-sample risk separation
Stable predicted means across train/test
Calibration consistency across deciles
Economically interpretable segmentation Primary frequency drivers:
Product type (dominant differentiator)
Channel
Vehicle age bands

Export artefacts

Phase 6 produces structured outputs under: outputs/phase6/

relativities_product.csv
relativities_channel.csv
relativities_vehicle_age_band.csv
phase6_exec_metrics.json

These artefacts mirror internal pricing workflows, where modelling outputs are converted into rating-engine-ready factors rather than remaining notebook-bound.

Why this matters

Phase 6 marks the transition from:

Modelling readiness → Certified predictive pricing layer

The digital twin now contains:

A distributionally justified frequency model
Governance-aligned exposure specification
Leakage-safe validation structure
Deployable rating relativities
All built on the frozen and validated dataset from Phase 1.

Notebooks

00_data_gen_validation.ipynb — generator sanity checks & governance gates
01_eda_frozen_synthetic_universe.ipynb — actuarial realism validation
02_portfolio_mix_premium_pricing_context.ipynb — pricing context on frozen data
03_loss_ratio_drilldown_actuarial.ipynb — Actuarial loss ratio drill-down using earned premium logic, with executive-ready visualisation and governance anchoring.
04_macro_cat_sensitivity.ipynb — Scenario engine, macro sensitivities, CAT stress, uncertainty bands, and RI impact
05_anomaly_audit_and_model_robustness.ipynb — Exposure validation, anomaly diagnostics, dispersion testing, risk signal stability checks, and modelling-readiness certification.
06_frequency_model_nb_glm_risk_signal_recovery.ipynb — Negative Binomial frequency model with exposure offset, calibration, lift validation, monotonicity audit, and pricing-ready relativities export.

Descriptive portfolio analytics → Controlled actuarial modelling

All built on frozen, governed data from Phase 1.

Repository structure

insurance-digital-twin/

├── data_gen/

│ ├── config.py # Portfolio assumptions & targets

│ ├── generators.py # Synthetic data generation logic

│ ├── schemas.py # Entity schemas (documentation & typing)

│ └── cli.py # Dataset generation + freeze entry point

│

├── data/

│ └── raw/

│ └── dataset_manifest.json # Dataset hashes + metadata

├── notebooks/

│ ├── 00_data_gen_validation.ipynb

│ ├── 01_eda_frozen_synthetic_universe.ipynb

│ ├── 02_portfolio_mix_premium_pricing_context.ipynb

│ └── 03_loss_ratio_drilldown_actuarial.ipynb

│ ├── 04_macro_cat_sensitivity.ipynb

│ └── ui/

│ └── scenario_simulator_exec_demo.py

│ └── 05_anomaly_audit_and_model_robustness.ipynb

│ └── 06_frequency_model_nb_glm_risk_signal_recovery.ipynb

└── README.md

How to run

Phase 1 — Generate & freeze dataset

python -m data_gen.cli

This produces a frozen dataset with a versioned manifest and cryptographic hashes.

Phase 2 — Pricing context analysis

Open and run:

notebooks/02_portfolio_mix_premium_pricing_context.ipynb

Phase 3 — Loss Ratio Drill-Down

Open and run:

notebooks/03_loss_ratio_drilldown_actuarial.ipynb

Phase 4 — Macro & CAT Scenario Sensitivity (Board View)

Open and run:

notebooks/04_macro_cat_sensitivity.ipynb

(Optional interactive demo)

notebooks/ui/scenario_simulator_exec_demo.py

They consume the frozen outputs from Phase 1.

Phase 5 - Anomaly Audit & Model Robustness (Modelling Readiness Gate)

Open and run:

notebooks/05_anomaly_audit_and_model_robustness.ipynb

Phase 6 - Technical Frequency Model (NB GLM)

Open and run:

notebooks/06_frequency_model_nb_glm_risk_signal_recovery.ipynb

⚠️ Phase 2, 3, 4, 5 & 6 do not regenerate data.

Releases:

v0.1 — Dataset Freeze & Governance
Frozen synthetic insurance universe with validation gates and manifest.
v0.2 — Portfolio Mix & Premium Distributions (Pricing Context)
Pricing context analysis on frozen data: mix, dispersion, concentration, and steering insights.
v0.3 — Loss Ratio Drill-Down (Actuarial View)
Earned premium–based loss ratio analysis with product × channel drill-down and executive-ready visualisation.
v0.4 — Macro & CAT Scenario Sensitivity (Board View)
Scenario engine, board packs, reinsurance effectiveness, and executive simulator.
v0.5 — Anomaly Audit & Model Robustness (Modelling Readiness Gate) Portfolio statistically validated and certified for NB GLM frequency modelling.
v0.6 — Technical Frequency Model (NB GLM)
Governance-aligned Negative Binomial frequency modelling with exposure offset, decile calibration, lift validation, and deployable pricing relativities.

What’s next

v0.7 — Fraud Model (Lift + Ring Detection)

Fraud propensity modelling
Ring detection via structural clustering
Fraud lift evaluation
Separation of pricing and fraud overlays
Fraud-adjusted scenario integration

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data/raw		data/raw
data_gen		data_gen
notebooks		notebooks
06_frequency_model_nb_glm_risk_signal_recovery.ipynb		06_frequency_model_nb_glm_risk_signal_recovery.ipynb
README.md		README.md

Himika-Mishra/insurance-digital-twin

Folders and files

Latest commit

History

Repository files navigation

Insurance Portfolio Digital Twin

Project Phases

Phase 1 — Synthetic Insurance Universe & Governance (v0.1)

Why this project exists

Phase 1 scope

What makes this different

Phase 2 — Portfolio Mix & Premium Distributions (Pricing Context) (v0.2)

Key outputs

Phase 3 — Loss Ratio Drill-Down (Actuarial View) (v0.3)

Key questions answered

Key outputs

Phase 4 — Macro & CAT Scenario Sensitivity (Board View) (v0.4)

Key questions answered

Key outputs

Board artefacts

Executive simulator (Streamlit)

Phase 5 — Anomaly Audit & Model Robustness (Modelling Readiness Gate) (v0.5)

Key questions answered

Key outputs

Why this matters

Phase 6 — Technical Frequency Model (Negative Binomial GLM) (v0.6)

Key questions answered

Key outputs

Statistical validation

Export artefacts

Why this matters

Notebooks

Repository structure

How to run

Phase 1 — Generate & freeze dataset

⚠️ Phase 2, 3, 4, 5 & 6 do not regenerate data.

Releases:

What’s next

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages