Calibration + abstention + decision-safe policy UI for loan approval.
Turn model scores into actions you can defend: auto-approve / auto-reject / send-to-review.
Most ML projects stop at “here’s the AUC”. Underwriting can’t.
In lending, a prediction is only useful if it can be turned into a decision policy with:
- Calibrated probabilities (a “0.90 approve” should mean ~90% approval correctness under similar conditions)
- Abstention (a review path for uncertain cases)
- Coverage tradeoffs (how many cases you can safely auto-decide without breaking quality)
- Decision-safe UI (so the user sees confidence + what rule triggered the outcome)
This lab implements a full, end-to-end workflow:
- Train a baseline loan approval model
- Calibrate predicted probabilities
- Build a coverage frontier (threshold ↔ auto-decision rate ↔ quality)
- Recommend a defensible threshold policy
- Provide an interactive Streamlit app for triage + reporting
Disclaimer: This is a data science lab / portfolio project. It is not financial advice and not a production underwriting system.
outputs/metrics_overall.json| model metrics on the test split (accuracy, F1, ROC-AUC, ECE, Brier, …)outputs/abstention_policy.json| recommended threshold + expected coverage + expected “auto” qualityoutputs/test_predictions.csv| per-row probabilities + labels (used by the triage UI)outputs/coverage_curve.csv| threshold sweep results (coverage vs performance)
Placed in reports/figures/:
- Confusion Matrix (baseline threshold)
- Coverage vs Performance (abstention tradeoff)
- Probability Histograms (separation + confidence)
- Reliability Diagram (calibration)
Tabs:
- Report Card
- Coverage Curve
- Triage UI
- Data Quality
- Notes
This lab uses Kaggle’s Loan Approval Dataset (loanapproval.csv).
Key columns (as shown in the Data Quality tab):
applicant_id(unique ID)age(numeric)gender(categorical)marital_status(categorical)annual_income(numeric)loan_amount(numeric)credit_score(numeric)num_dependents(numeric)existing_loans_count(numeric)employment_status(categorical)loan_approved(target, 0/1)
- Rows: 1000
- Columns: 11
- Missingness: 0 in all columns (in this dataset snapshot)
- Target balance: approval is majority (typical of curated demo datasets)
Even if missingness is 0 here, the Data Quality tab is important: underwriting models are extremely sensitive to quiet schema drift (new employment types, score ranges shifting, etc).
underwriting-decision-safety-lab/
├─ app/
│ └─ app.py # Streamlit dashboard
├─ data/
│ └─ raw/
│ └─ loanapproval.csv
├─ outputs/ # generated JSON/CSV artifacts
├─ reports/
│ └─ figures/ # generated PNG charts
└─ src/
├─ pipeline.py # main pipeline entrypoint
├─ clean.py # cleaning + schema normalization
├─ train.py # model training
├─ calibrate.py # sigmoid/isotonic calibration
├─ abstention.py # threshold sweep + policy recommendation
├─ metrics.py # ECE/Brier/etc
└─ plots.py # figure generation
python -m venv .venv
# Windows:
.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
pip install -r requirements.txtpython -m src.pipeline --input data/raw/loanapproval.csvYou should see something like:
- Done!
- Outputs:
outputs/ - Figures:
reports/figures/
streamlit run app/app.pyBelow are the dashboard screens you shared. Each section explains:
- what the screen is answering
- how to interpret the results
- what actions you can take next
Goal: Answer “Is the model good enough to consider auto-decisions and what’s the safe default policy?”
-
Top-level test metrics: Accuracy, F1, ROC-AUC, ECE, Brier
-
A recommended abstention policy (threshold and expected coverage)
-
A 2×2 figure grid:
- Confusion matrix
- Reliability diagram
- Coverage vs performance
- Probability histograms
“How often do we predict correctly overall?” Why it’s not enough: A model can be accurate but dangerously overconfident.
Balances precision and recall (especially useful when class imbalance exists). Underwriting interpretation: Helps you avoid “approve everything” or “reject everything” behavior.
Ranking quality: “Do approved cases generally get higher scores than rejected cases?” Underwriting interpretation: Strong AUC helps, but it doesn’t guarantee good threshold decisions.
Measures how far confidence deviates from reality across probability bins.
- If the model says ~0.8 approval 100 times, about 80 of them should actually be approved.
- High ECE means probabilities are not trustworthy without calibration.
Mean squared error of probabilistic predictions.
- Lower is better.
- Penalizes confident wrong predictions heavily.
“At a baseline threshold (often 0.5), what kinds of mistakes are we making?”
-
Rows = true label
-
Columns = predicted label
-
Diagonal = correct predictions
-
Off-diagonals = errors:
- False approvals (predict approve but actually reject), risk/credit loss
- False rejections (predict reject but actually approve), opportunity cost/customer friction
Underwriting typically does not operate at a single fixed threshold. The point of this repo is to move from “0.5 classifier” → policy:
- conservative auto-approves
- conservative auto-rejects
- everything else → review
“Can we trust predicted probabilities as probabilities?”
-
X-axis: predicted probability (binned)
-
Y-axis: observed accuracy / empirical frequency
-
The diagonal line = perfect calibration
- points above line: underconfident (reality > confidence)
- points below line: overconfident (confidence > reality)
Abstention rules depend on confidence thresholds like:
- auto-decide only if confidence ≥ 0.85 If probabilities are miscalibrated, “0.85” is not meaningful.
- Points close to diagonal across mid-to-high probability regions
- Especially important near the decision threshold you’ll deploy
“Does the model separate approvals from rejections and where does uncertainty live?”
-
Two overlapping histograms:
- Approved (y=1)
- Rejected (y=0)
-
If distributions are well-separated, the model can confidently auto-decide more cases.
-
If they overlap heavily near the middle, you’ll need more abstention/review.
- A large mass near 1.0 for approvals suggests strong “safe approve” region.
- A spread-out rejection distribution suggests rejections are harder to identify or less consistent.
- The overlap zone is your review queue candidate.
“How much quality do we gain if we abstain more?”
-
Coverage: fraction of cases the system auto-decides
-
Auto-performance: accuracy/F1 measured only on auto-decided cases
-
As threshold increases:
- coverage usually decreases
- auto-quality usually increases
- Decide a target coverage (e.g., 70% auto-decide)
- Choose the confidence threshold that achieves it
- Verify auto-quality is acceptable
- Review queue size becomes (1 - coverage)
High auto-accuracy is easy if you abstain on everything difficult. So you must always report both coverage + quality together.
Goal: Make the coverage frontier interactive and easy to inspect.
-
A “knee” in the curve: a region where a small reduction in coverage buys a big jump in quality
-
Stability: avoid thresholds where tiny changes cause big swings
-
A defensible operating point:
- “At threshold 0.85 we auto-decide ~70% with ~0.98 auto-accuracy”
Goal: Show how an underwriter or analyst would experience the model.
-
You enter applicant features (age, income, loan amount, credit score, etc.)
-
The app outputs:
p(approve)- a confidence measure (often max probability or margin)
- a decision: AUTO-DECIDE or REVIEW
- a bar chart of class probabilities
Instead of saying “approve” because p=0.71, the UI says:
- AUTO-DECIDE when confidence ≥ threshold
- REVIEW otherwise
This makes the system defensible:
- you can explain what confidence threshold you chose
- you can estimate workload (review volume)
- you can monitor drift (coverage changing over time)
A raw probability without a policy invites misuse:
- different teams interpret it differently
- thresholds get chosen ad hoc
- you lose traceability for “why was this decision made?”
Goal: Catch problems before you trust metrics.
Even in clean demo datasets, underwriting systems in the wild break due to:
- income missing for a new channel
- employment status missing for a partner integration
- credit scores outside expected range
- negative loan amounts
- impossible ages
- new
employment_statusvalues - changes in marital status encoding
A policy like “auto-decide above 0.85” assumes your feature distribution is similar to training. Data quality checks are the “trust gate” before policy application.
- Accuracy ≠ trust
- ECE is calibration error
- Coverage is a product metric (review queue size is not free)
A real system should add:
- fairness slice audits (calibration + error rates by gender/age/employment)
- monitoring: score drift, approval-rate drift, coverage drift
- cost-aware policy: false approvals vs false rejections vs review cost
- retraining triggers when calibration degrades
The app shows a recommended policy (example from your report card screen):
- Threshold (confidence): 0.85
- Expected coverage: 0.70
- Auto accuracy: 0.977
- Auto F1: 0.986
Interpretation:
- The system auto-decides ~70% of applicants.
- The remaining ~30% go to human review.
- Auto-decided cases are high-confidence, so quality is high.
- This is not “cheating”, it is a conscious design decision that turns ML into a safe workflow.
Right now, many prototypes use one confidence threshold. Underwriting often benefits from:
- auto-approve if p(approve) ≥ T_approve
- auto-reject if p(approve) ≤ T_reject
- else review
This reduces review load while controlling risk.
Replace “maximize accuracy” with:
- cost(false approval) >> cost(false rejection)
- cost(review) as a workload term
Then choose thresholds that minimize expected cost.
Add slice dashboards:
- ECE by gender
- error rate by age band
- approval rate by employment status
- coverage by subgroup
Track weekly:
- score distribution drift
- coverage drift
- approval-rate drift
- calibration drift (ECE moving)
Newer Streamlit versions prefer:
width="stretch"instead ofuse_container_width=True
If you see deprecation warnings, update your st.plotly_chart(...) and st.image(...) calls accordingly.
Ensure:
st.set_page_config(layout="wide")- Use consistent containers/columns
- Use
width="stretch"for charts/images in Streamlit
- Dataset: https://www.kaggle.com/datasets/amineipad/loan-approval-dataset
- Tools: pandas, scikit-learn, Streamlit, matplotlib/plotly